jmlr jmlr2009 knowledge-graph by maker-knowledge-mining

jmlr 2009 knowledge graph

papers list:

1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama

Abstract: We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efﬁcient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efﬁcient than competing approaches. Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation

2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

Author: Jacob Abernethy, Francis Bach, Theodoros Evgeniou, Jean-Philippe Vert

Abstract: We present a general approach for collaborative ﬁltering (CF) using spectral regularization to learn linear operators mapping a set of “users” to a set of possibly desired “objects”. In particular, several recent low-rank type matrix-completion methods for CF are shown to be special cases of our proposed framework. Unlike existing regularization-based CF, our approach can be used to incorporate additional information such as attributes of the users/objects—a feature currently lacking in existing regularization-based CF approaches—using popular and well-known kernel methods. We provide novel representer theorems that we use to develop new estimation methods. We then provide learning algorithms based on low-rank decompositions and test them on a standard CF data set. The experiments indicate the advantages of generalizing the existing regularization-based CF methods to incorporate related information about users and objects. Finally, we show that certain multi-task learning methods can be also seen as special cases of our proposed approach. Keywords: collaborative ﬁltering, matrix completion, kernel methods, spectral regularization

3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

Author: Marc Boullé

Abstract: With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. In this paper, we present a parameter-free scalable classiﬁcation method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classiﬁcation enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. We focus on the complexity of the algorithms and show how they can cope with data sets that are far larger than the available central memory. We ﬁnally report results on the Large Scale Learning challenge, where our method obtains state of the art performance within practicable computation time. Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging

4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

Author: Asela Gunawardana, Guy Shani

Abstract: Recommender systems are now popular both commercially and in the research community, where many algorithms have been suggested for providing recommendations. These algorithms typically perform differently in various domains and tasks. Therefore, it is important from the research perspective, as well as from a practical view, to be able to decide on an algorithm that matches the domain and the task of interest. The standard way to make such decisions is by comparing a number of algorithms ofﬂine using some evaluation metric. Indeed, many evaluation metrics have been suggested for comparing recommendation algorithms. The decision on the proper evaluation metric is often critical, as each metric may favor a different algorithm. In this paper we review the proper construction of ofﬂine experiments for deciding on the most appropriate algorithm. We discuss three important tasks of recommender systems, and classify a set of appropriate well known evaluation metrics for each task. We demonstrate how using an improper evaluation metric can lead to the selection of an improper algorithm for the task of interest. We also discuss other important considerations when designing ofﬂine experiments. Keywords: recommender systems, collaborative ﬁltering, statistical analysis, comparative studies

5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence

Author: Gilles Blanchard, Étienne Roquain

Abstract: In the context of multiple hypothesis testing, the proportion π0 of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efﬁcency is called adaptive. In this paper, we focus on the issue of false discovery rate (FDR) control and we present new adaptive multiple testing procedures with control of the FDR. In a ﬁrst part, assuming independence of the p-values, we present two new procedures and give a uniﬁed review of other existing adaptive procedures that have provably controlled FDR. We report extensive simulation results comparing these procedures and testing their robustness when the independence assumption is violated. The new proposed procedures appear competitive with existing ones. The overall best, though, is reported to be Storey’s estimator, albeit for a speciﬁc parameter setting that does not appear to have been considered before. In a second part, we propose adaptive versions of step-up procedures that have provably controlled FDR under positive dependence and unspeciﬁed dependence of the p-values, respectively. In the latter case, while simulations only show an improvement over non-adaptive procedures in limited situations, these are to our knowledge among the ﬁrst theoretically founded adaptive multiple testing procedures that control the FDR when the p-values are not independent. Keywords: multiple testing, false discovery rate, adaptive procedure, positive regression dependence, p-values

6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity

Author: Jose M. Peña, Roland Nilsson, Johan Björkegren, Jesper Tegnér

Abstract: We present a sound and complete graphical criterion for reading dependencies from the minimal undirected independence map G of a graphoid M that satisﬁes weak transitivity. Here, complete means that it is able to read all the dependencies in M that can be derived by applying the graphoid properties and weak transitivity to the dependencies used in the construction of G and the independencies obtained from G by vertex separation. We argue that assuming weak transitivity is not too restrictive. As an intermediate step in the derivation of the graphical criterion, we prove that for any undirected graph G there exists a strictly positive discrete probability distribution with the prescribed sample spaces that is faithful to G. We also report an algorithm that implements the graphical criterion and whose running time is considered to be at most O(n2 (e + n)) for n nodes and e edges. Finally, we illustrate how the graphical criterion can be used within bioinformatics to identify biologically meaningful gene dependencies. Keywords: graphical models, vertex separation, graphoids, weak transitivity, bioinformatics

7 jmlr-2009-An Analysis of Convex Relaxations for MAP Estimation of Discrete MRFs

Author: M. Pawan Kumar, Vladimir Kolmogorov, Philip H.S. Torr

Abstract: The problem of obtaining the maximum a posteriori estimate of a general discrete Markov random ﬁeld (i.e., a Markov random ﬁeld deﬁned using a discrete set of labels) is known to be NP-hard. However, due to its central importance in many applications, several approximation algorithms have been proposed in the literature. In this paper, we present an analysis of three such algorithms based on convex relaxations: (i) LP - S: the linear programming (LP) relaxation proposed by Schlesinger (1976) for a special case and independently in Chekuri et al. (2001), Koster et al. (1998), and Wainwright et al. (2005) for the general case; (ii) QP - RL: the quadratic programming (QP) relaxation of Ravikumar and Lafferty (2006); and (iii) SOCP - MS: the second order cone programming (SOCP) relaxation ﬁrst proposed by Muramatsu and Suzuki (2003) for two label problems and later extended by Kumar et al. (2006) for a general label set. We show that the SOCP - MS and the QP - RL relaxations are equivalent. Furthermore, we prove that despite the ﬂexibility in the form of the constraints/objective function offered by QP and SOCP, the LP - S relaxation strictly dominates (i.e., provides a better approximation than) QP - RL and SOCP MS . We generalize these results by deﬁning a large class of SOCP (and equivalent QP ) relaxations which is dominated by the LP - S relaxation. Based on these results we propose some novel SOCP relaxations which deﬁne constraints using random variables that form cycles or cliques in the graphical model representation of the random ﬁeld. Using some examples we show that the new SOCP relaxations strictly dominate the previous approaches. Keywords: probabilistic models, MAP estimation, discrete MRF, convex relaxations, linear programming, second-order cone programming, quadratic programming, dominating relaxations ∗. Work done while at the Dept. of Computing, Oxford Brookes University. c 2008 M. Pawan Kumar, Vladimir Kolmogorov, Philip H.S. Torr. K UMAR , KOLMOG

8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

Author: Luciana Ferrer, Kemal Sönmez, Elizabeth Shriberg

Abstract: We present a method for training support vector machine (SVM)-based classiﬁcation systems for combination with other classiﬁcation systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average class-conditional covariance between the resulting scores and the scores produced by the existing systems, introducing a trade-off between such covariance and the system’s individual performance. That is, the new system “takes one for the team”, falling somewhat short of its best possible performance in order to increase the diversity of the ensemble. We report results on the NIST 2005 and 2006 speaker recognition evaluations (SREs) for a variety of subsystems. We show a gain of 19% on the equal error rate (EER) of a combination of four systems when applying the proposed method with respect to the performance obtained when the four systems are trained independently of each other. Keywords: system combination, ensemble diversity, multiple classiﬁer systems, support vector machines, speaker recognition, kernel methods ∗. This author performed part of the work presented in this paper while at the Information Systems Laboratory, Department of Electrical Engineering, Stanford University. c 2009 Luciana Ferrer, Kemal S¨ nmez and Elizabeth Shriberg. o ¨ F ERRER , S ONMEZ AND S HRIBERG

9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modiﬁcation of the Perceptron update with an adaptive ﬁltering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

Author: Eitan Greenshtein, Junyong Park

Abstract: We consider the problem of classiﬁcation using high dimensional features’ space. In a paper by Bickel and Levina (2004), it is recommended to use naive-Bayes classiﬁers, that is, to treat the features as if they are statistically independent. Consider now a sparse setup, where only a few of the features are informative for classiﬁcation. Fan and Fan (2008), suggested a variable selection and classiﬁcation method, called FAIR. The FAIR method improves the design of naive-Bayes classiﬁers in sparse setups. The improvement is due to reducing the noise in estimating the features’ means. This reduction is since that only the means of a few selected variables should be estimated. We also consider the design of naive Bayes classiﬁers. We show that a good alternative to variable selection is estimation of the means through a certain non parametric empirical Bayes procedure. In sparse setups the empirical Bayes implicitly performs an efﬁcient variable selection. It also adapts very well to non sparse setups, and has the advantage of making use of the information from many “weakly informative” variables, which variable selection type of classiﬁcation procedures give up on using. We compare our method with FAIR and other classiﬁcation methods in simulation for sparse and non sparse setups, and in real data examples involving classiﬁcation of normal versus malignant tissues based on microarray data. Keywords: non parametric empirical Bayes, high dimension, classiﬁcation

11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification

Author: Raanan Yehezkel, Boaz Lerner

Abstract: We propose the recursive autonomy identiﬁcation (RAI) algorithm for constraint-based (CB) Bayesian network structure learning. The RAI algorithm learns the structure by sequential application of conditional independence (CI) tests, edge direction and structure decomposition into autonomous sub-structures. The sequence of operations is performed recursively for each autonomous substructure while simultaneously increasing the order of the CI test. While other CB algorithms d-separate structures and then direct the resulted undirected graph, the RAI algorithm combines the two processes from the outset and along the procedure. By this means and due to structure decomposition, learning a structure using RAI requires a smaller number of CI tests of high orders. This reduces the complexity and run-time of the algorithm and increases the accuracy by diminishing the curse-of-dimensionality. When the RAI algorithm learned structures from databases representing synthetic problems, known networks and natural problems, it demonstrated superiority with respect to computational complexity, run-time, structural correctness and classiﬁcation accuracy over the PC, Three Phase Dependency Analysis, Optimal Reinsertion, greedy search, Greedy Equivalence Search, Sparse Candidate, and Max-Min Hill-Climbing algorithms. Keywords: Bayesian networks, constraint-based structure learning

12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

Author: Saharon Rosset

Abstract: We show how to follow the path of cross validated solutions to families of regularized optimization problems, deﬁned by a combination of a parameterized loss function and a regularization term. A primary example is kernel quantile regression, where the parameter of the loss function is the quantile being estimated. Even though the bi-level optimization problem we encounter for every quantile is non-convex, the manner in which the optimal cross-validated solution evolves with the parameter of the loss function allows tracking of this solution. We prove this property, construct the resulting algorithm, and demonstrate it on real and artiﬁcial data. This algorithm allows us to efﬁciently solve the whole family of bi-level problems. We show how it can be extended to cover other modeling problems, like support vector regression, and alternative in-sample model selection approaches.1

13 jmlr-2009-Bounded Kernel-Based Online Learning

Author: Francesco Orabona, Joseph Keshet, Barbara Caputo

Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the ﬁrst online bounded algorithm that can learn complex classiﬁcation tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set

14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

Author: Roberto Esposito, Daniele P. Radicioni

Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always ﬁnds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus conﬁrming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classiﬁers optimization, exact inference

15 jmlr-2009-Cautious Collective Classification

Author: Luke K. McDowell, Kalyan Moy Gupta, David W. Aha

Abstract: Many collective classiﬁcation (CC) algorithms have been shown to increase accuracy when instances are interrelated. However, CC algorithms must be carefully applied because their use of estimated labels can in some cases decrease accuracy. In this article, we show that managing this label uncertainty through cautious algorithmic behavior is essential to achieving maximal, robust performance. First, we describe cautious inference and explain how four well-known families of CC algorithms can be parameterized to use varying degrees of such caution. Second, we introduce cautious learning and show how it can be used to improve the performance of almost any CC algorithm, with or without cautious inference. We then evaluate cautious inference and learning for the four collective inference families, with three local classiﬁers and a range of both synthetic and real-world data. We ﬁnd that cautious learning and cautious inference typically outperform less cautious approaches. In addition, we identify the data characteristics that predict more substantial performance differences. Our results reveal that the degree of caution used usually has a larger impact on performance than the choice of the underlying inference algorithm. Together, these results identify the most appropriate CC algorithms to use for particular task characteristics and explain multiple conﬂicting ﬁndings from prior CC research. Keywords: collective inference, statistical relational learning, approximate probabilistic inference, networked data, cautious inference

16 jmlr-2009-Classification with Gaussians and Convex Loss

Author: Dao-Hong Xiang, Ding-Xuan Zhou

Abstract: This paper considers binary classiﬁcation algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassiﬁcation error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classiﬁcation, general convex loss, varying Gaussian kernels, covering number, approximation

17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors

Author: Mathias Drton, Michael Eichler, Thomas S. Richardson

Abstract: In recursive linear models, the multivariate normal joint distribution of all variables exhibits a dependence structure induced by a recursive (or acyclic) system of linear structural equations. These linear models have a long tradition and appear in seemingly unrelated regressions, structural equation modelling, and approaches to causal inference. They are also related to Gaussian graphical models via a classical representation known as a path diagram. Despite the models’ long history, a number of problems remain open. In this paper, we address the problem of computing maximum likelihood estimates in the subclass of ‘bow-free’ recursive linear models. The term ‘bow-free’ refers to the condition that the errors for variables i and j be uncorrelated if variable i occurs in the structural equation for variable j. We introduce a new algorithm, termed Residual Iterative Conditional Fitting (RICF), that can be implemented using only least squares computations. In contrast to existing algorithms, RICF has clear convergence properties and yields exact maximum likelihood estimates after the ﬁrst iteration whenever the MLE is available in closed form. Keywords: linear regression, maximum likelihood estimation, path diagram, structural equation model, recursive semi-Markov model, residual iterative conditional ﬁtting

18 jmlr-2009-Consistency and Localizability

Author: Alon Zakai, Ya'acov Ritov

Abstract: We show that all consistent learning methodsĂ˘€”that is, that asymptotically achieve the lowest possible expected loss for any distribution on (X,Y )Ă˘€”are necessarily localizable, by which we mean that they do not signiÄ?Ĺš cantly change their response at a particular point when we show them only the part of the training set that is close to that point. This is true in particular for methods that appear to be deÄ?Ĺš ned in a non-local manner, such as support vector machines in classiÄ?Ĺš cation and least-squares estimators in regression. Aside from showing that consistency implies a speciÄ?Ĺš c form of localizability, we also show that consistency is logically equivalent to the combination of two properties: (1) a form of localizability, and (2) that the methodĂ˘€™s global mean (over the entire X distribution) correctly estimates the true mean. Consistency can therefore be seen as comprised of two aspects, one local and one global. Keywords: consistency, local learning, regression, classiÄ?Ĺš cation

19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

Author: Junning Li, Z. Jane Wang

Abstract: In real world applications, graphical statistical models are not only a tool for operations such as classiﬁcation or prediction, but usually the network structures of the models themselves are also of great interest (e.g., in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. We prove that the proposed method can control the FDR under a user-speciﬁed level at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-speciﬁed level, and a heuristic modiﬁcation of the method is able to control the FDR more accurately around the user-speciﬁed level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. Keywords: Bayesian networks, false discovery rate, PC algorithm, directed acyclic graph, skeleton

20 jmlr-2009-DL-Learner: Learning Concepts in Description Logics

Author: Jens Lehmann

Abstract: In this paper, we introduce DL-Learner, a framework for learning in description logics and OWL. OWL is the ofÄ?Ĺš cial W3C standard ontology language for the Semantic Web. Concepts in this language can be learned for constructing and maintaining OWL ontologies or for solving problems similar to those in Inductive Logic Programming. DL-Learner includes several learning algorithms, support for different OWL formats, reasoner interfaces, and learning problems. It is a cross-platform framework implemented in Java. The framework allows easy programmatic access and provides a command line interface, a graphical interface as well as a WSDL-based web service. Keywords: concept learning, description logics, OWL, classiÄ?Ĺš cation, open-source

21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

23 jmlr-2009-Discriminative Learning Under Covariate Shift

24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

25 jmlr-2009-Distributed Algorithms for Topic Models

26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit (Machine Learning Open Source Software Paper)

27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

29 jmlr-2009-Estimating Labels from Label Proportions

30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling

32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions