jmlr jmlr2013 knowledge-graph by maker-knowledge-mining

jmlr 2013 knowledge graph

papers list:

1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

Author: Hervé Frezza-Buet, Matthieu Geist

Abstract: This paper introduces the rllib as an original C++ template-based library oriented toward value function estimation. Generic programming is promoted here as a way of having a good ﬁt between the mathematics of reinforcement learning and their implementation in a library. The main concepts of rllib are presented, as well as a short example. Keywords: reinforcement learning, C++, generic programming 1. C++ Genericity for Fitting the Mathematics of Reinforcement Learning Reinforcement learning (RL) is a ﬁeld of machine learning that beneﬁts from a rigorous mathematical formalism, as shown for example by Bertsekas (1995). Although this formalism is well accepted in the ﬁeld, its translation into efﬁcient computer science tools has surprisingly not led to any standard yet, as mentioned by Kovacs and Egginton (2011). The claim of this paper is that genericity enables a natural expression of the mathematics of RL. The rllib (2011) library implements this idea in the C++ language, where genericity relies on templates. Templates automate the re-writing of some generic code involving user types, offering a strong type checking at compile time that improves the code safety. Using the rllib templates requires that the user-deﬁned types ﬁt some documented concepts. For example, some class C deﬁning an agent should be designed so that C::state type is the type used for the states, C::action type is the type used for the actions, and the method C::action type policy(const C::state type& s) const is implemented, in order to compute the action to be performed in a given state. This concept deﬁnition speciﬁes what is required for an agent mathematically. Note that C does not need to inherit from any kind of abstract rl::Agent class to be used by the rllib tools. It can be directly provided as a type argument to any rllib template requiring an argument ﬁtting the concept of an agent, so that the re-written code actually compiles. 2. A Short Example Let us consider the following toy-example. The state space contains values from 0 to 9, and actions consist in increasing or decreasing the value. When the value gets out of bounds, a reward is returned ∗. Also at UMI 2958 Georgia Tech / CNRS, 2-3, rue Marconi, 57070 Metz, France. c 2013 Herv´ Frezza-Buet and Matthieu Geist. e F REZZA -B UET AND G EIST (-1 for bound 0, 1 for bound 9). Otherwise, a null reward is returned. Let us deﬁne this problem and run Sarsa. First, a simulator class ﬁtting the concept Simulator described in the documentation is needed. c l a s s Sim { / / Our s i m u l a t o r c l a s s . No i n h e r i t a n c e r e q u i r e d . private : i n t c u r r e n t ; double r ; public : typedef int phase type ; typedef int observation type ; t y p e d e f enum { up , down} a c t i o n t y p e ; t y p e d e f d o u b l e r e w a r d t y p e ; Sim ( v o i d ) : c u r r e n t ( 0 ) , r ( 0 ) {} v o i d s e t P h a s e ( c o n s t p h a s e t y p e &s; ) { c u r r e n t = s %10;} c o n s t o b s e r v a t i o n t y p e& s e n s e ( v o i d ) c o n s t { r e t u r n c u r r e n t ; } reward type reward ( void ) const { return r ;} v o i d t i m e S t e p ( c o n s t a c t i o n t y p e &a; ) { i f ( a == up ) c u r r e n t ++; e l s e c u r r e n t −−; i f ( c u r r e n t < 0 ) r =−1; e l s e i f ( c u r r e n t > 9 ) r = 1 ; e l s e r = 0 ; i f ( r ! = 0 ) throw r l : : e x c e p t i o n : : T e r m i n a l ( ” Out o f r a n g e ” ) ; } }; Following the concept requirements, the class Sim naturally implements a sensor method sense that provides an observation from the current phase of the controlled dynamical system, and a method timeStep that computes a transition consecutive to some action. Note the use of exceptions for terminal states. For the sake of simplicity in further code, the following is added. typedef typedef typedef typedef typedef Sim : : p h a s e t y p e Sim : : a c t i o n t y p e r l : : I t e r a t o r t y p e d e f r l : : a g e n t : : o n l i n e : : E p s i l o n G r e e d y Critic ; ArgmaxCritic ; TestAgent ; LearnAgent ; The rllib expresses that Sarsa provides a critic, offering a Q-function. As actions are discrete, the best action (i.e., argmaxa∈A Q(s, a)) can be found by considering all the actions sequentially. This is what ArgmaxCritic offers thanks to the action enumerator Aenum, in order to deﬁne greedy and ε-greedy agents. The main function then only consists in running episodes with the appropriate agents. i n t main ( i n t a r g c , char ∗ a r g v [ ] ) { Sim simulator ; Transition transition ; ArgmaxCritic c r i t i c ; LearnAgent learner ( critic ); TestAgent tester ( critic ); A a; S s; int episode , length , s t e p =0; // // // // // // // T h i s i s what t h e a g e n t c o n t r o l s . T h i s i s some s , a , r , s ’ , a ’ d a t a . T h i s c o m p u t e s Q and argmax a Q( s , a ) . SARSA u s e s t h i s a g e n t t o l e a r n t h e p o l i c y . This behaves according to the c r i t i c . Some a c t i o n . Some s t a t e . f o r ( e p i s o d e = 0 ; e p i s o d e < 1 0 0 0 0 ; ++ e p i s o d e ) { / / L e a r n i n g p h a s e s i m u l a t o r . setPhase ( rand ()%10); r l : : episode : : sa : : r u n a n d l e a r n ( simulator , l e a r n e r , t r a n s i t i o n , 0 , length ) ; } try { / / T e s t phase simulator . setPhase (0); while ( true ) { s = simulator . sense ( ) ; a = t e s t e r . policy ( s ) ; s t e p ++; s i m u l a t o r . t i m e S t e p ( a ) ; } } c a t c h ( r l : : e x c e p t i o n : : T e r m i n a l e ) { s t d : : c o u t << s t e p << ” s t e p s . ” << s t d : : e n d l ; } return 0 ; / / t h e message p r i n t e d i s ‘ ‘10 s t e p s . ’ ’ } 3. Features of the Library Using the library requires to deﬁne the features that are speciﬁc to the problem (the simulator and the Q-function architecture in our example) from scratch, but with the help of concepts. Then, the speciﬁc features can be handled by generic code provided by the library to implement RL techniques with value function estimation. 627 F REZZA -B UET AND G EIST Currently, Q-learing, Sarsa, KTD-Q, LSTD, and policy iteration are available, as well as a multi-layer perceptron architecture. Moreover, some benchmark problems (i.e., simulators) are also provided: the mountain car, the cliff walking, the inverted pendulum and the Boyan chain. Extending the library with new algorithms is allowed, since it consists in deﬁning new templates. This is a bit more technical than only using the existing algorithms, but the structure of existing concepts helps, since it reﬂects the mathematics of RL. For example, concepts like Feature, for linear approaches mainly (i.e., Q(s, a) = θT ϕ(s, a)) and Architecture (i.e., Q(s, a) = fθ (s, a) for more general approximation) orient the design toward functional approaches of RL. The algorithms implemented so far rely on the GNU Scientiﬁc Library (see GSL, 2011) for linear algebra computation, so the GPL licence of GSL propagates to the rllib. 4. Conclusion The rllib relies only on the C++ standard and the availability of the GSL on the system. It offers state-action function approximation tools for applying RL to real problems, as well as a design that ﬁts the mathematics. The latter allows for extensions, but is also compliant with pedagogical purpose. The design of the rllib aims at allowing the user to build (using C++ programming) its own experiment, using several algorithms, several agents, on-line or batch learning, and so on. Actually, the difﬁcult part of RL is the algorithms themselves, not the script-like part of the experiment where things are put together (see the main function in our example). With a framework, in the sense of Kovacs and Egginton (2011), the experiment is not directly accessible to the user programs, since it is handled by some libraries in order to offer graphical interface or analyzing tools. The user code is then called by the framework when required. We advocate that allowing the user to call the rllib functionality at his/her convenience provides an open and extensible access to RL for students, researchers and engineers. Last, the rllib ﬁts the requirements expressed by Kovacs and Egginton (2011, Section 4.3): support of good scientiﬁc research, formulation compliant with the domain, allowing for any kind of agents and any kind of approximators, interoperability of components (the Q function of the example can be used for different algorithms and agents), maximization of run-time speed (use of C++ and templates that inline massively the code), open source, etc. Extensions of rllib can be considered, for example for handling POMDPs, and contributions of users are expected. The use of templates is unfortunately unfamiliar to many programmers, but the effort is worth it, since it brings the code at the level of the mathematical formalism, increasing readability (by a rational use of typedefs) and reducing bugs. Even if the approach is dramatically different from existing frameworks, wrappings with frameworks can be considered in further development. References Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientiﬁc, 3rd (20052007) edition, 1995. GSL, 2011. http://http://www.gnu.org/software/gsl. Tim Kovacs and Robert Egginton. On the analysis and design of software for reinforcement learning, with a survey of existing systems. Machine Learning, 84:7–49, 2011. rllib, 2011. http://ims.metz.supelec.fr/spip.php?article122. 628

2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

Author: Daniil Ryabko, Jérémie Mary

Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classiﬁcation methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classiﬁcation and concern highly dependent time series. Speciﬁcally, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions

3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

Author: Krzysztof Chalupka, Christopher K. I. Williams, Iain Murray

Abstract: Gaussian process (GP) predictors are an important component of many Bayesian approaches to machine learning. However, even a straightforward implementation of Gaussian process regression (GPR) requires O(n2 ) space and O(n3 ) time for a data set of n examples. Several approximation methods have been proposed, but there is a lack of understanding of the relative merits of the different approximations, and in what situations they are most useful. We recommend assessing the quality of the predictions obtained as a function of the compute time taken, and comparing to standard baselines (e.g., Subset of Data and FITC). We empirically investigate four different approximation algorithms on four different prediction problems, and make our code available to encourage future comparisons. Keywords: Gaussian process regression, subset of data, FITC, local GP

4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

Author: Kamalika Chaudhuri, Anand D. Sarwate, Kaushik Sinha

Abstract: The principal components analysis (PCA) algorithm is a standard tool for identifying good lowdimensional approximations to high-dimensional data. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method. Keywords: differential privacy, principal components analysis, dimension reduction

6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

Abstract: The Neyman-Pearson (NP) paradigm in binary classiﬁcation treats type I and type II errors with different priorities. It seeks classiﬁers that minimize type II error, subject to a type I error constraint under a user speciﬁed level α. In this paper, plug-in classiﬁers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classiﬁers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classiﬁers handle different sampling schemes. This work focuses on theoretical properties of the proposed classiﬁers; in particular, we derive oracle inequalities that can be viewed as ﬁnite sample versions of risk bounds. NP classiﬁcation can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a speciﬁc form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection

7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a ﬁnite dimensional subspace (as speciﬁed by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inﬂation, ridge regression, pca

8 jmlr-2013-A Theory of Multiclass Boosting

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classiﬁers to form highly accurate predictors. Although the case of binary classiﬁcation is well understood, in the multiclass setting, the “correct” requirements on the weak classiﬁer, or the notion of the most efﬁcient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classiﬁer, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

Author: Sumio Watanabe

Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive deﬁnite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is deﬁned by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difﬁcult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we deﬁne a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion

10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

Author: Philip M. Long, Rocco A. Servedio

Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classiﬁers

11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries

Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis

Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efﬁcient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efﬁciency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness

12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb

Abstract: Despite the simplicity of the Naive Bayes classiﬁer, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to reﬁning the naive Bayes classiﬁer, attribute weighting has received less attention than it warrants. Most approaches, perhaps inﬂuenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and ﬁnd that WANBIA is a competitive alternative to state of the art classiﬁers like Random Forest, Logistic Regression and A1DE. Keywords: classiﬁcation, naive Bayes, attribute independence assumption, weighted naive Bayes classiﬁcation

13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Speciﬁcally, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are veriﬁed in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-ﬁeld (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we deﬁne its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning ﬂows

14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

Author: Pierre Neuvial

Abstract: The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is deﬁned as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the BenjaminiHochberg procedure achieves FDR control at any pre-speciﬁed level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the p-values, as the number m of tested hypotheses grows to inﬁnity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form m−k/(2k+1) , where k is determined by the regularity of the density of the p-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model. Keywords: multiple testing, false discovery rate, Benjamini Hochberg’s procedure, power, criticality, plug-in procedures, adaptive control, test statistics distribution, convergence rates, kernel estimators

15 jmlr-2013-Bayesian Canonical Correlation Analysis

Author: Arto Klami, Seppo Virtanen, Samuel Kaski

Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulﬁlling the promise yet; we introduce a novel efﬁcient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-speciﬁc components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation

16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

Author: Matthew J. Johnson, Alan S. Willsky

Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efﬁcient posterior inference. The methods we introduce also provide new methods for sampling inference in the ﬁnite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model

17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Author: Nima Noorshams, Martin J. Wainwright

Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP ﬁxed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefﬁcients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP ﬁxed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefﬁcients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical ﬂow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion

18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown

Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classiﬁcation problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general deﬁnition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classiﬁer. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound

19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efﬁcient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classiﬁcation, large-scale learning, SVM, machine learning toolbox

20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

Author: Fang Han, Tuo Zhao, Han Liu

Abstract: We propose a high dimensional classiﬁcation method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efﬁciency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassiﬁcation error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra ﬂexibility gained by the CODA method incurs little efﬁciency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics

21 jmlr-2013-Classifier Selection using the Predicate Depth

22 jmlr-2013-Classifying With Confidence From Incomplete Information

23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

24 jmlr-2013-Comment on "Robustness and Regularization of Support Vector Machines" by H. Xu et al. (Journal of Machine Learning Research, vol. 10, pp. 1485-1510, 2009)

25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

32 jmlr-2013-Differential Privacy for Functions and Functional Data