jmlr jmlr2013 knowledge-graph by maker-knowledge-mining
1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics
Author: Hervé Frezza-Buet, Matthieu Geist
Abstract: This paper introduces the rllib as an original C++ template-based library oriented toward value function estimation. Generic programming is promoted here as a way of having a good fit between the mathematics of reinforcement learning and their implementation in a library. The main concepts of rllib are presented, as well as a short example. Keywords: reinforcement learning, C++, generic programming 1. C++ Genericity for Fitting the Mathematics of Reinforcement Learning Reinforcement learning (RL) is a field of machine learning that benefits from a rigorous mathematical formalism, as shown for example by Bertsekas (1995). Although this formalism is well accepted in the field, its translation into efficient computer science tools has surprisingly not led to any standard yet, as mentioned by Kovacs and Egginton (2011). The claim of this paper is that genericity enables a natural expression of the mathematics of RL. The rllib (2011) library implements this idea in the C++ language, where genericity relies on templates. Templates automate the re-writing of some generic code involving user types, offering a strong type checking at compile time that improves the code safety. Using the rllib templates requires that the user-defined types fit some documented concepts. For example, some class C defining an agent should be designed so that C::state type is the type used for the states, C::action type is the type used for the actions, and the method C::action type policy(const C::state type& s) const is implemented, in order to compute the action to be performed in a given state. This concept definition specifies what is required for an agent mathematically. Note that C does not need to inherit from any kind of abstract rl::Agent class to be used by the rllib tools. It can be directly provided as a type argument to any rllib template requiring an argument fitting the concept of an agent, so that the re-written code actually compiles. 2. A Short Example Let us consider the following toy-example. The state space contains values from 0 to 9, and actions consist in increasing or decreasing the value. When the value gets out of bounds, a reward is returned ∗. Also at UMI 2958 Georgia Tech / CNRS, 2-3, rue Marconi, 57070 Metz, France. c 2013 Herv´ Frezza-Buet and Matthieu Geist. e F REZZA -B UET AND G EIST (-1 for bound 0, 1 for bound 9). Otherwise, a null reward is returned. Let us define this problem and run Sarsa. First, a simulator class fitting the concept Simulator described in the documentation is needed. c l a s s Sim { / / Our s i m u l a t o r c l a s s . No i n h e r i t a n c e r e q u i r e d . private : i n t c u r r e n t ; double r ; public : typedef int phase type ; typedef int observation type ; t y p e d e f enum { up , down} a c t i o n t y p e ; t y p e d e f d o u b l e r e w a r d t y p e ; Sim ( v o i d ) : c u r r e n t ( 0 ) , r ( 0 ) {} v o i d s e t P h a s e ( c o n s t p h a s e t y p e &s; ) { c u r r e n t = s %10;} c o n s t o b s e r v a t i o n t y p e& s e n s e ( v o i d ) c o n s t { r e t u r n c u r r e n t ; } reward type reward ( void ) const { return r ;} v o i d t i m e S t e p ( c o n s t a c t i o n t y p e &a; ) { i f ( a == up ) c u r r e n t ++; e l s e c u r r e n t −−; i f ( c u r r e n t < 0 ) r =−1; e l s e i f ( c u r r e n t > 9 ) r = 1 ; e l s e r = 0 ; i f ( r ! = 0 ) throw r l : : e x c e p t i o n : : T e r m i n a l ( ” Out o f r a n g e ” ) ; } }; Following the concept requirements, the class Sim naturally implements a sensor method sense that provides an observation from the current phase of the controlled dynamical system, and a method timeStep that computes a transition consecutive to some action. Note the use of exceptions for terminal states. For the sake of simplicity in further code, the following is added. typedef typedef typedef typedef typedef Sim : : p h a s e t y p e Sim : : a c t i o n t y p e r l : : I t e r a t o r t y p e d e f r l : : a g e n t : : o n l i n e : : E p s i l o n G r e e d y Critic ; ArgmaxCritic ; TestAgent ; LearnAgent ; The rllib expresses that Sarsa provides a critic, offering a Q-function. As actions are discrete, the best action (i.e., argmaxa∈A Q(s, a)) can be found by considering all the actions sequentially. This is what ArgmaxCritic offers thanks to the action enumerator Aenum, in order to define greedy and ε-greedy agents. The main function then only consists in running episodes with the appropriate agents. i n t main ( i n t a r g c , char ∗ a r g v [ ] ) { Sim simulator ; Transition transition ; ArgmaxCritic c r i t i c ; LearnAgent learner ( critic ); TestAgent tester ( critic ); A a; S s; int episode , length , s t e p =0; // // // // // // // T h i s i s what t h e a g e n t c o n t r o l s . T h i s i s some s , a , r , s ’ , a ’ d a t a . T h i s c o m p u t e s Q and argmax a Q( s , a ) . SARSA u s e s t h i s a g e n t t o l e a r n t h e p o l i c y . This behaves according to the c r i t i c . Some a c t i o n . Some s t a t e . f o r ( e p i s o d e = 0 ; e p i s o d e < 1 0 0 0 0 ; ++ e p i s o d e ) { / / L e a r n i n g p h a s e s i m u l a t o r . setPhase ( rand ()%10); r l : : episode : : sa : : r u n a n d l e a r n ( simulator , l e a r n e r , t r a n s i t i o n , 0 , length ) ; } try { / / T e s t phase simulator . setPhase (0); while ( true ) { s = simulator . sense ( ) ; a = t e s t e r . policy ( s ) ; s t e p ++; s i m u l a t o r . t i m e S t e p ( a ) ; } } c a t c h ( r l : : e x c e p t i o n : : T e r m i n a l e ) { s t d : : c o u t << s t e p << ” s t e p s . ” << s t d : : e n d l ; } return 0 ; / / t h e message p r i n t e d i s ‘ ‘10 s t e p s . ’ ’ } 3. Features of the Library Using the library requires to define the features that are specific to the problem (the simulator and the Q-function architecture in our example) from scratch, but with the help of concepts. Then, the specific features can be handled by generic code provided by the library to implement RL techniques with value function estimation. 627 F REZZA -B UET AND G EIST Currently, Q-learing, Sarsa, KTD-Q, LSTD, and policy iteration are available, as well as a multi-layer perceptron architecture. Moreover, some benchmark problems (i.e., simulators) are also provided: the mountain car, the cliff walking, the inverted pendulum and the Boyan chain. Extending the library with new algorithms is allowed, since it consists in defining new templates. This is a bit more technical than only using the existing algorithms, but the structure of existing concepts helps, since it reflects the mathematics of RL. For example, concepts like Feature, for linear approaches mainly (i.e., Q(s, a) = θT ϕ(s, a)) and Architecture (i.e., Q(s, a) = fθ (s, a) for more general approximation) orient the design toward functional approaches of RL. The algorithms implemented so far rely on the GNU Scientific Library (see GSL, 2011) for linear algebra computation, so the GPL licence of GSL propagates to the rllib. 4. Conclusion The rllib relies only on the C++ standard and the availability of the GSL on the system. It offers state-action function approximation tools for applying RL to real problems, as well as a design that fits the mathematics. The latter allows for extensions, but is also compliant with pedagogical purpose. The design of the rllib aims at allowing the user to build (using C++ programming) its own experiment, using several algorithms, several agents, on-line or batch learning, and so on. Actually, the difficult part of RL is the algorithms themselves, not the script-like part of the experiment where things are put together (see the main function in our example). With a framework, in the sense of Kovacs and Egginton (2011), the experiment is not directly accessible to the user programs, since it is handled by some libraries in order to offer graphical interface or analyzing tools. The user code is then called by the framework when required. We advocate that allowing the user to call the rllib functionality at his/her convenience provides an open and extensible access to RL for students, researchers and engineers. Last, the rllib fits the requirements expressed by Kovacs and Egginton (2011, Section 4.3): support of good scientific research, formulation compliant with the domain, allowing for any kind of agents and any kind of approximators, interoperability of components (the Q function of the example can be used for different algorithms and agents), maximization of run-time speed (use of C++ and templates that inline massively the code), open source, etc. Extensions of rllib can be considered, for example for handling POMDPs, and contributions of users are expected. The use of templates is unfortunately unfamiliar to many programmers, but the effort is worth it, since it brings the code at the level of the mathematical formalism, increasing readability (by a rational use of typedefs) and reducing bugs. Even if the approach is dramatically different from existing frameworks, wrappings with frameworks can be considered in further development. References Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 3rd (20052007) edition, 1995. GSL, 2011. http://http://www.gnu.org/software/gsl. Tim Kovacs and Robert Egginton. On the analysis and design of software for reinforcement learning, with a survey of existing systems. Machine Learning, 84:7–49, 2011. rllib, 2011. http://ims.metz.supelec.fr/spip.php?article122. 628
Author: Daniil Ryabko, Jérémie Mary
Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions
3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
Author: Krzysztof Chalupka, Christopher K. I. Williams, Iain Murray
Abstract: Gaussian process (GP) predictors are an important component of many Bayesian approaches to machine learning. However, even a straightforward implementation of Gaussian process regression (GPR) requires O(n2 ) space and O(n3 ) time for a data set of n examples. Several approximation methods have been proposed, but there is a lack of understanding of the relative merits of the different approximations, and in what situations they are most useful. We recommend assessing the quality of the predictions obtained as a function of the compute time taken, and comparing to standard baselines (e.g., Subset of Data and FITC). We empirically investigate four different approximation algorithms on four different prediction problems, and make our code available to encourage future comparisons. Keywords: Gaussian process regression, subset of data, FITC, local GP
4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
Author: Kamalika Chaudhuri, Anand D. Sarwate, Kaushik Sinha
Abstract: The principal components analysis (PCA) algorithm is a standard tool for identifying good lowdimensional approximations to high-dimensional data. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method. Keywords: differential privacy, principal components analysis, dimension reduction
6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
Author: Xin Tong
Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection
7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression
Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar
Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca
8 jmlr-2013-A Theory of Multiclass Boosting
Author: Indraneel Mukherjee, Robert E. Schapire
Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games
9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
Author: Sumio Watanabe
Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion
10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
Author: Philip M. Long, Rocco A. Servedio
Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers
11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis
Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness
12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting
Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb
Abstract: Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE. Keywords: classification, naive Bayes, attribute independence assumption, weighted naive Bayes classification
13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation
Author: Michael Chertkov, Adam B. Yedidia
Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Specifically, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are verified in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-field (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we define its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning flows
Author: Pierre Neuvial
Abstract: The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the BenjaminiHochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the p-values, as the number m of tested hypotheses grows to infinity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form m−k/(2k+1) , where k is determined by the regularity of the density of the p-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model. Keywords: multiple testing, false discovery rate, Benjamini Hochberg’s procedure, power, criticality, plug-in procedures, adaptive control, test statistics distribution, convergence rates, kernel estimators
15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models
Author: Matthew J. Johnson, Alan S. Willsky
Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference. The methods we introduce also provide new methods for sampling inference in the finite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model
Author: Nima Noorshams, Martin J. Wainwright
Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion
Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown
Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound
19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang
Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efficient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classification, large-scale learning, SVM, machine learning toolbox
20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
21 jmlr-2013-Classifier Selection using the Predicate Depth
22 jmlr-2013-Classifying With Confidence From Incomplete Information
23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising
31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
32 jmlr-2013-Differential Privacy for Functions and Functional Data
33 jmlr-2013-Dimension Independent Similarity Computation
34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis
39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming
41 jmlr-2013-Experiment Selection for Causal Discovery
42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection
43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints
45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes
46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization
50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
51 jmlr-2013-Greedy Sparsity-Constrained Optimization
53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling
54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections
56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
58 jmlr-2013-Language-Motivated Approaches to Action Recognition
59 jmlr-2013-Large-scale SVD and Manifold Learning
60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion
63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars
64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs
67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library
68 jmlr-2013-Machine Learning with Operational Costs
69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses
70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach
71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization
72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
73 jmlr-2013-Multicategory Large-Margin Unified Machines
74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
76 jmlr-2013-Nonparametric Sparsity and Regularization
77 jmlr-2013-On the Convergence of Maximum Variance Unfolding
78 jmlr-2013-On the Learnability of Shuffle Ideals
79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression
80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
82 jmlr-2013-Optimally Fuzzy Temporal Memory
83 jmlr-2013-Orange: Data Mining Toolbox in Python
84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models
86 jmlr-2013-Parallel Vector Field Embedding
87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules
90 jmlr-2013-Quasi-Newton Method: A New Direction
91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies
92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs
93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections
96 jmlr-2013-Regularization-Free Principal Curve Estimation
97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model
99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
104 jmlr-2013-Sparse Single-Index Model
105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
106 jmlr-2013-Stationary-Sparse Causality Network Learning
107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
108 jmlr-2013-Stochastic Variational Inference
109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library
113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
114 jmlr-2013-The Rate of Convergence of AdaBoost
115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods
118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks
119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
120 jmlr-2013-Variational Algorithms for Marginal MAP
121 jmlr-2013-Variational Inference in Nonconjugate Models