jmlr jmlr2006 knowledge-graph by maker-knowledge-mining
1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms
Author: Mingrui Wu, Bernhard Schölkopf, Gökhan Bakır
Abstract: Many kernel learning algorithms, including support vector machines, result in a kernel machine, such as a kernel classifier, whose key component is a weight vector in a feature space implicitly introduced by a positive definite kernel function. This weight vector is usually obtained by solving a convex optimization problem. Based on this fact we present a direct method to build sparse kernel learning algorithms by adding one more constraint to the original convex optimization problem, such that the sparseness of the resulting kernel machine is explicitly controlled while at the same time performance is kept as high as possible. A gradient based approach is provided to solve this modified optimization problem. Applying this method to the support vectom machine results in a concrete algorithm for building sparse large margin classifiers. These classifiers essentially find a discriminating subspace that can be spanned by a small number of vectors, and in this subspace, the different classes of data are linearly well separated. Experimental results over several classification benchmarks demonstrate the effectiveness of our approach. Keywords: sparse learning, sparse large margin classifiers, kernel learning algorithms, support vector machine, kernel Fisher discriminant
2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs
Author: Alberto Roverato, Milan Studený
Abstract: This paper deals with chain graph models under alternative AMP interpretation. A new representative of an AMP Markov equivalence class, called the largest deflagged graph, is proposed. The representative is based on revealed internal structure of the AMP Markov equivalence class. More specifically, the AMP Markov equivalence class decomposes into finer strong equivalence classes and there exists a distinguished strong equivalence class among those forming the AMP Markov equivalence class. The largest deflagged graph is the largest chain graph in that distinguished strong equivalence class. A composed graphical procedure to get the largest deflagged graph on the basis of any AMP Markov equivalent chain graph is presented. In general, the largest deflagged graph differs from the AMP essential graph, which is another representative of the AMP Markov equivalence class. Keywords: chain graph, AMP Markov equivalence, strong equivalence, largest deflagged graph, component merging procedure, deflagging procedure, essential graph
3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection
Author: Hichem Sahbi, Donald Geman
Abstract: We introduce a computational design for pattern detection based on a tree-structured network of support vector machines (SVMs). An SVM is associated with each cell in a recursive partitioning of the space of patterns (hypotheses) into increasingly finer subsets. The hierarchy is traversed coarse-to-fine and each chain of positive responses from the root to a leaf constitutes a detection. Our objective is to design and build a network which balances overall error and computation. Initially, SVMs are constructed for each cell with no constraints. This “free network” is then perturbed, cell by cell, into another network, which is “graded” in two ways: first, the number of support vectors of each SVM is reduced (by clustering) in order to adjust to a pre-determined, increasing function of cell depth; second, the decision boundaries are shifted to preserve all positive responses from the original set of training data. The limits on the numbers of clusters (virtual support vectors) result from minimizing the mean computational cost of collecting all detections subject to a bound on the expected number of false positives. When applied to detecting faces in cluttered scenes, the patterns correspond to poses and the free network is already faster and more accurate than applying a single pose-specific SVM many times. The graded network promotes very rapid processing of background regions while maintaining the discriminatory power of the free network. Keywords: statistical learning, hierarchy of classifiers, coarse-to-fine computation, support vector machines, face detection
4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery
Author: Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, Antti Kerminen
Abstract: In recent years, several methods have been proposed for the discovery of causal structure from non-experimental data. Such methods make various assumptions on the data generating process to facilitate its identification from purely observational data. Continuing this line of research, we show how to discover the complete causal structure of continuous-valued data, under the assumptions that (a) the data generating process is linear, (b) there are no unobserved confounders, and (c) disturbance variables have non-Gaussian distributions of non-zero variances. The solution relies on the use of the statistical method known as independent component analysis, and does not require any pre-specified time-ordering of the variables. We provide a complete Matlab package for performing this LiNGAM analysis (short for Linear Non-Gaussian Acyclic Model), and demonstrate the effectiveness of the method using artificially generated data and real-world data. Keywords: independent component analysis, non-Gaussianity, causal discovery, directed acyclic graph, non-experimental data
Author: Robert Castelo, Alberto Roverato
Abstract: Learning of large-scale networks of interactions from microarray data is an important and challenging problem in bioinformatics. A widely used approach is to assume that the available data constitute a random sample from a multivariate distribution belonging to a Gaussian graphical model. As a consequence, the prime objects of inference are full-order partial correlations which are partial correlations between two variables given the remaining ones. In the context of microarray data the number of variables exceed the sample size and this precludes the application of traditional structure learning procedures because a sampling version of full-order partial correlations does not exist. In this paper we consider limited-order partial correlations, these are partial correlations computed on marginal distributions of manageable size, and provide a set of rules that allow one to assess the usefulness of these quantities to derive the independence structure of the underlying Gaussian graphical model. Furthermore, we introduce a novel structure learning procedure based on a quantity, obtained from limited-order partial correlations, that we call the non-rejection rate. The applicability and usefulness of the procedure are demonstrated by both simulated and real data. Keywords: Gaussian distribution, gene network, graphical model, microarray data, non-rejection rate, partial correlation, small-sample inference
Author: Luis M. de Campos
Abstract: We propose a new scoring function for learning Bayesian networks from data using score+search algorithms. This is based on the concept of mutual information and exploits some well-known properties of this measure in a novel way. Essentially, a statistical independence test based on the chi-square distribution, associated with the mutual information measure, together with a property of additive decomposition of this measure, are combined in order to measure the degree of interaction between each variable and its parent variables in the network. The result is a non-Bayesian scoring function called MIT (mutual information tests) which belongs to the family of scores based on information theory. The MIT score also represents a penalization of the Kullback-Leibler divergence between the joint probability distributions associated with a candidate network and with the available data set. Detailed results of a complete experimental evaluation of the proposed scoring function and its comparison with the well-known K2, BDeu and BIC/MDL scores are also presented. Keywords: Bayesian networks, scoring functions, learning, mutual information, conditional independence tests
Author: Shalabh Bhatnagar, Vivek S. Borkar, Madhukar Akarapu
Abstract: We study the problem of long-run average cost control of Markov chains conditioned on a rare event. In a related recent work, a simulation based algorithm for estimating performance measures associated with a Markov chain conditioned on a rare event has been developed. We extend ideas from this work and develop an adaptive algorithm for obtaining, online, optimal control policies conditioned on a rare event. Our algorithm uses three timescales or step-size schedules. On the slowest timescale, a gradient search algorithm for policy updates that is based on one-simulation simultaneous perturbation stochastic approximation (SPSA) type estimates is used. Deterministic perturbation sequences obtained from appropriate normalized Hadamard matrices are used here. The fast timescale recursions compute the conditional transition probabilities of an associated chain by obtaining solutions to the multiplicative Poisson equation (for a given policy estimate). Further, the risk parameter associated with the value function for a given policy estimate is updated on a timescale that lies in between the two scales above. We briefly sketch the convergence analysis of our algorithm and present a numerical application in the setting of routing multiple flows in communication networks. Keywords: Markov decision processes, optimal control conditioned on a rare event, simulation based algorithms, SPSA with deterministic perturbations, reinforcement learning
8 jmlr-2006-A Very Fast Learning Method for Neural Networks Based on Sensitivity Analysis
Author: Enrique Castillo, Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero, Amparo Alonso-Betanzos
Abstract: This paper introduces a learning method for two-layer feedforward neural networks based on sensitivity analysis, which uses a linear training algorithm for each of the two layers. First, random values are assigned to the outputs of the first layer; later, these initial values are updated based on sensitivity formulas, which use the weights in each of the layers; the process is repeated until convergence. Since these weights are learnt solving a linear system of equations, there is an important saving in computational time. The method also gives the local sensitivities of the least square errors with respect to input and output data, with no extra computational cost, because the necessary information becomes available without extra calculations. This method, called the Sensitivity-Based Linear Learning Method, can also be used to provide an initial set of weights, which significantly improves the behavior of other learning algorithms. The theoretical basis for the method is given and its performance is illustrated by its application to several examples in which it is compared with several learning algorithms and well known data sets. The results have shown a learning speed generally faster than other existing methods. In addition, it can be used as an initialization tool for other well known methods with significant improvements. Keywords: supervised learning, neural networks, linear optimization, least-squares, initialization method, sensitivity analysis
9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
Author: Mikio L. Braun
Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds
Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour
Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (n/ε2 ) log(1/δ) times to find an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.
Author: Masashi Sugiyama
Abstract: The goal of active learning is to determine the locations of training input points so that the generalization error is minimized. We discuss the problem of active learning in linear regression scenarios. Traditional active learning methods using least-squares learning often assume that the model used for learning is correctly specified. In many practical situations, however, this assumption may not be fulfilled. Recently, active learning methods using “importance”-weighted least-squares learning have been proposed, which are shown to be robust against misspecification of models. In this paper, we propose a new active learning method also using the weighted least-squares learning, which we call ALICE (Active Learning using the Importance-weighted least-squares learning based on Conditional Expectation of the generalization error). An important difference from existing methods is that we predict the conditional expectation of the generalization error given training input points, while existing methods predict the full expectation of the generalization error. Due to this difference, the training input design can be fine-tuned depending on the realization of training input points. Theoretically, we prove that the proposed active learning criterion is a more accurate predictor of the single-trial generalization error than the existing criterion. Numerical studies with toy and benchmark data sets show that the proposed method compares favorably to existing methods. Keywords: Active Learning, Conditional Expectation of Generalization Error, Misspecification of Models, Importance-Weighted Least-Squares Learning, Covariate Shift.
12 jmlr-2006-Active Learning with Feedback on Features and Instances
Author: Hema Raghavan, Omid Madani, Rosie Jones
Abstract: We extend the traditional active learning framework to include feedback on features in addition to labeling instances, and we execute a careful study of the effects of feature selection and human feedback on features in the setting of text categorization. Our experiments on a variety of categorization tasks indicate that there is significant potential in improving classifier performance by feature re-weighting, beyond that achieved via membership queries alone (traditional active learning) if we have access to an oracle that can point to the important (most predictive) features. Our experiments on human subjects indicate that human feedback on feature relevance can identify a sufficient proportion of the most relevant features (over 50% in our experiments). We find that on average, labeling a feature takes much less time than labeling a document. We devise an algorithm that interleaves labeling features and documents which significantly accelerates standard active learning in our simulation experiments. Feature feedback can complement traditional active learning in applications such as news filtering, e-mail classification, and personalization, where the human teacher can have significant knowledge on the relevance of features. Keywords: active learning, feature selection, relevance feedback, term feedback, text classification
13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies
Author: Fu Chang, Chin-Chin Lin, Chi-Jen Lu
Abstract: In this paper, we propose a number of adaptive prototype learning (APL) algorithms. They employ the same algorithmic scheme to determine the number and location of prototypes, but differ in the use of samples or the weighted averages of samples as prototypes, and also in the assumption of distance measures. To understand these algorithms from a theoretical viewpoint, we address their convergence properties, as well as their consistency under certain conditions. We also present a soft version of APL, in which a non-zero training error is allowed in order to enhance the generalization power of the resultant classifier. Applying the proposed algorithms to twelve UCI benchmark data sets, we demonstrate that they outperform many instance-based learning algorithms, the k-nearest neighbor rule, and support vector machines in terms of average test accuracy. Keywords: adaptive prototype learning, cluster-based prototypes, consistency, instance-based prototype, pattern classification 1
Author: Katya Scheinberg
Abstract: We propose an active set algorithm to solve the convex quadratic programming (QP) problem which is the core of the support vector machine (SVM) training. The underlying method is not new and is based on the extensive practice of the Simplex method and its variants for convex quadratic problems. However, its application to large-scale SVM problems is new. Until recently the traditional active set methods were considered impractical for large SVM problems. By adapting the methods to the special structure of SVM problems we were able to produce an efficient implementation. We conduct an extensive study of the behavior of our method and its variations on SVM problems. We present computational results comparing our method with Joachims’ SVM light (see Joachims, 1999). The results show that our method has overall better performance on many SVM problems. It seems to have a particularly strong advantage on more difficult problems. In addition this algorithm has better theoretical properties and it naturally extends to the incremental mode. Since the proposed method solves the standard SVM formulation, as does SVMlight , the generalization properties of these two approaches are identical and we do not discuss them in the paper. Keywords: active set methods, support vector machines, quadratic programming
Author: Radu Stefan Niculescu, Tom M. Mitchell, R. Bharat Rao
Abstract: The task of learning models for many real-world problems requires incorporating domain knowledge into learning algorithms, to enable accurate learning from a realistic volume of training data. This paper considers a variety of types of domain knowledge for constraining parameter estimates when learning Bayesian networks. In particular, we consider domain knowledge that constrains the values or relationships among subsets of parameters in a Bayesian network with known structure. We incorporate a wide variety of parameter constraints into learning procedures for Bayesian networks, by formulating this task as a constrained optimization problem. The assumptions made in module networks, dynamic Bayes nets and context specific independence models can be viewed as particular cases of such parameter constraints. We present closed form solutions or fast iterative algorithms for estimating parameters subject to several specific classes of parameter constraints, including equalities and inequalities among parameters, constraints on individual parameters, and constraints on sums and ratios of parameters, for discrete and continuous variables. Our methods cover learning from both frequentist and Bayesian points of view, from both complete and incomplete data. We present formal guarantees for our estimators, as well as methods for automatically learning useful parameter constraints from data. To validate our approach, we apply it to the domain of fMRI brain image analysis. Here we demonstrate the ability of our system to first learn useful relationships among parameters, and then to use them to constrain the training of the Bayesian network, resulting in improved cross-validated accuracy of the learned model. Experiments on synthetic data are also presented. Keywords: Bayesian networks, constrained optimization, domain knowledge c 2006 Radu Stefan Niculescu, Tom Mitchell and Bharat Rao. N ICULESCU , M ITCHELL AND R AO
16 jmlr-2006-Bounds for Linear Multi-Task Learning
Author: Andreas Maurer
Abstract: We give dimension-free and data-dependent bounds for linear multi-task learning where a common linear operator is chosen to preprocess data for a vector of task specific linear-thresholding classifiers. The complexity penalty of multi-task learning is bounded by a simple expression involving the margins of the task-specific classifiers, the Hilbert-Schmidt norm of the selected preprocessor and the Hilbert-Schmidt norm of the covariance operator for the total mixture of all task distributions, or, alternatively, the Frobenius norm of the total Gramian matrix for the data-dependent version. The results can be compared to state-of-the-art results on linear single-task learning. Keywords: learning to learn, transfer learning, multi-task learning
Author: Magnus Ekdahl, Timo Koski
Abstract: In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Na¨ve Bayes classifier. We show that the perforı mance of the Na¨ve Bayes depends on the degree of functional dependence between the features ı and labels. We provide a sufficient condition for zero loss of performance, too. Keywords: Bayesian networks, na¨ve Bayes, plug-in classifier, Kolmogorov distance of variation, ı variational learning
Author: S. Sathiya Keerthi, Olivier Chapelle, Dennis DeCoste
Abstract: Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed, due to the number of support vectors being large. To overcome this problem we devise a primal method with the following properties: (1) it decouples the idea of basis functions from the concept of support vectors; (2) it greedily finds a set of kernel basis functions of a specified maximum size (dmax ) to approximate the SVM primal cost function well; (3) it is 2 efficient and roughly scales as O(ndmax ) where n is the number of training examples; and, (4) the number of basis functions it requires to achieve an accuracy close to the SVM accuracy is usually far less than the number of SVM support vectors. Keywords: SVMs, classification, sparse design
19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs
Author: Anders Jonsson, Andrew Barto
Abstract: We present Variable Influence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efficient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can significantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction
20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation
Author: Jelle R. Kok, Nikos Vlassis
Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation
22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers
23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms
24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
25 jmlr-2006-Distance Patterns in Structural Similarity
28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning
32 jmlr-2006-Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems
34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates
36 jmlr-2006-In Search of Non-Gaussian Components of a High-Dimensional Distribution
37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
43 jmlr-2006-Large Scale Multiple Kernel Learning (Special Topic on Machine Learning and Optimization)
44 jmlr-2006-Large Scale Transductive SVMs
45 jmlr-2006-Learning Coordinate Covariances via Gradients
46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
47 jmlr-2006-Learning Image Components for Object Recognition
48 jmlr-2006-Learning Minimum Volume Sets
49 jmlr-2006-Learning Parts-Based Representations of Data
52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
53 jmlr-2006-Learning a Hidden Hypergraph
54 jmlr-2006-Learning the Structure of Linear Latent Variable Models
57 jmlr-2006-Linear State-Space Models for Blind Source Separation
58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation
63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification
64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
65 jmlr-2006-Nonparametric Quantile Estimation
66 jmlr-2006-On Model Selection Consistency of Lasso
67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations
68 jmlr-2006-On the Complexity of Learning Lexicographic Strategies
69 jmlr-2006-One-Class Novelty Detection for Seizure Analysis from Intracranial EEG
70 jmlr-2006-Online Passive-Aggressive Algorithms
73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs
75 jmlr-2006-Policy Gradient in Continuous Time
76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines
77 jmlr-2006-Quantile Regression Forests
78 jmlr-2006-Rearrangement Clustering: Pitfalls, Remedies, and Applications
80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling
81 jmlr-2006-Some Discriminant-Based PAC Algorithms
82 jmlr-2006-Some Theory for Generalized Boosting Algorithms
84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes
85 jmlr-2006-Statistical Comparisons of Classifiers over Multiple Data Sets
86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space
87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation
88 jmlr-2006-Streamwise Feature Selection
92 jmlr-2006-Toward Attribute Efficient Learning of Decision Lists and Parities
93 jmlr-2006-Universal Kernels
94 jmlr-2006-Using Machine Learning to Guide Architecture Simulation
95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification
97 jmlr-2006- (Special Topic on Machine Learning for Computer Security)