jmlr jmlr2013 knowledge-graph by maker-knowledge-mining

jmlr 2013 knowledge graph


similar papers computed by tfidf model


similar papers computed by lsi model


similar papers computed by lda model


papers list:

1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

Author: Hervé Frezza-Buet, Matthieu Geist

Abstract: This paper introduces the rllib as an original C++ template-based library oriented toward value function estimation. Generic programming is promoted here as a way of having a good fit between the mathematics of reinforcement learning and their implementation in a library. The main concepts of rllib are presented, as well as a short example. Keywords: reinforcement learning, C++, generic programming 1. C++ Genericity for Fitting the Mathematics of Reinforcement Learning Reinforcement learning (RL) is a field of machine learning that benefits from a rigorous mathematical formalism, as shown for example by Bertsekas (1995). Although this formalism is well accepted in the field, its translation into efficient computer science tools has surprisingly not led to any standard yet, as mentioned by Kovacs and Egginton (2011). The claim of this paper is that genericity enables a natural expression of the mathematics of RL. The rllib (2011) library implements this idea in the C++ language, where genericity relies on templates. Templates automate the re-writing of some generic code involving user types, offering a strong type checking at compile time that improves the code safety. Using the rllib templates requires that the user-defined types fit some documented concepts. For example, some class C defining an agent should be designed so that C::state type is the type used for the states, C::action type is the type used for the actions, and the method C::action type policy(const C::state type& s) const is implemented, in order to compute the action to be performed in a given state. This concept definition specifies what is required for an agent mathematically. Note that C does not need to inherit from any kind of abstract rl::Agent class to be used by the rllib tools. It can be directly provided as a type argument to any rllib template requiring an argument fitting the concept of an agent, so that the re-written code actually compiles. 2. A Short Example Let us consider the following toy-example. The state space contains values from 0 to 9, and actions consist in increasing or decreasing the value. When the value gets out of bounds, a reward is returned ∗. Also at UMI 2958 Georgia Tech / CNRS, 2-3, rue Marconi, 57070 Metz, France. c 2013 Herv´ Frezza-Buet and Matthieu Geist. e F REZZA -B UET AND G EIST (-1 for bound 0, 1 for bound 9). Otherwise, a null reward is returned. Let us define this problem and run Sarsa. First, a simulator class fitting the concept Simulator described in the documentation is needed. c l a s s Sim { / / Our s i m u l a t o r c l a s s . No i n h e r i t a n c e r e q u i r e d . private : i n t c u r r e n t ; double r ; public : typedef int phase type ; typedef int observation type ; t y p e d e f enum { up , down} a c t i o n t y p e ; t y p e d e f d o u b l e r e w a r d t y p e ; Sim ( v o i d ) : c u r r e n t ( 0 ) , r ( 0 ) {} v o i d s e t P h a s e ( c o n s t p h a s e t y p e &s; ) { c u r r e n t = s %10;} c o n s t o b s e r v a t i o n t y p e& s e n s e ( v o i d ) c o n s t { r e t u r n c u r r e n t ; } reward type reward ( void ) const { return r ;} v o i d t i m e S t e p ( c o n s t a c t i o n t y p e &a; ) { i f ( a == up ) c u r r e n t ++; e l s e c u r r e n t −−; i f ( c u r r e n t < 0 ) r =−1; e l s e i f ( c u r r e n t > 9 ) r = 1 ; e l s e r = 0 ; i f ( r ! = 0 ) throw r l : : e x c e p t i o n : : T e r m i n a l ( ” Out o f r a n g e ” ) ; } }; Following the concept requirements, the class Sim naturally implements a sensor method sense that provides an observation from the current phase of the controlled dynamical system, and a method timeStep that computes a transition consecutive to some action. Note the use of exceptions for terminal states. For the sake of simplicity in further code, the following is added. typedef typedef typedef typedef typedef Sim : : p h a s e t y p e Sim : : a c t i o n t y p e r l : : I t e r a t o r t y p e d e f r l : : a g e n t : : o n l i n e : : E p s i l o n G r e e d y Critic ; ArgmaxCritic ; TestAgent ; LearnAgent ; The rllib expresses that Sarsa provides a critic, offering a Q-function. As actions are discrete, the best action (i.e., argmaxa∈A Q(s, a)) can be found by considering all the actions sequentially. This is what ArgmaxCritic offers thanks to the action enumerator Aenum, in order to define greedy and ε-greedy agents. The main function then only consists in running episodes with the appropriate agents. i n t main ( i n t a r g c , char ∗ a r g v [ ] ) { Sim simulator ; Transition transition ; ArgmaxCritic c r i t i c ; LearnAgent learner ( critic ); TestAgent tester ( critic ); A a; S s; int episode , length , s t e p =0; // // // // // // // T h i s i s what t h e a g e n t c o n t r o l s . T h i s i s some s , a , r , s ’ , a ’ d a t a . T h i s c o m p u t e s Q and argmax a Q( s , a ) . SARSA u s e s t h i s a g e n t t o l e a r n t h e p o l i c y . This behaves according to the c r i t i c . Some a c t i o n . Some s t a t e . f o r ( e p i s o d e = 0 ; e p i s o d e < 1 0 0 0 0 ; ++ e p i s o d e ) { / / L e a r n i n g p h a s e s i m u l a t o r . setPhase ( rand ()%10); r l : : episode : : sa : : r u n a n d l e a r n ( simulator , l e a r n e r , t r a n s i t i o n , 0 , length ) ; } try { / / T e s t phase simulator . setPhase (0); while ( true ) { s = simulator . sense ( ) ; a = t e s t e r . policy ( s ) ; s t e p ++; s i m u l a t o r . t i m e S t e p ( a ) ; } } c a t c h ( r l : : e x c e p t i o n : : T e r m i n a l e ) { s t d : : c o u t << s t e p << ” s t e p s . ” << s t d : : e n d l ; } return 0 ; / / t h e message p r i n t e d i s ‘ ‘10 s t e p s . ’ ’ } 3. Features of the Library Using the library requires to define the features that are specific to the problem (the simulator and the Q-function architecture in our example) from scratch, but with the help of concepts. Then, the specific features can be handled by generic code provided by the library to implement RL techniques with value function estimation. 627 F REZZA -B UET AND G EIST Currently, Q-learing, Sarsa, KTD-Q, LSTD, and policy iteration are available, as well as a multi-layer perceptron architecture. Moreover, some benchmark problems (i.e., simulators) are also provided: the mountain car, the cliff walking, the inverted pendulum and the Boyan chain. Extending the library with new algorithms is allowed, since it consists in defining new templates. This is a bit more technical than only using the existing algorithms, but the structure of existing concepts helps, since it reflects the mathematics of RL. For example, concepts like Feature, for linear approaches mainly (i.e., Q(s, a) = θT ϕ(s, a)) and Architecture (i.e., Q(s, a) = fθ (s, a) for more general approximation) orient the design toward functional approaches of RL. The algorithms implemented so far rely on the GNU Scientific Library (see GSL, 2011) for linear algebra computation, so the GPL licence of GSL propagates to the rllib. 4. Conclusion The rllib relies only on the C++ standard and the availability of the GSL on the system. It offers state-action function approximation tools for applying RL to real problems, as well as a design that fits the mathematics. The latter allows for extensions, but is also compliant with pedagogical purpose. The design of the rllib aims at allowing the user to build (using C++ programming) its own experiment, using several algorithms, several agents, on-line or batch learning, and so on. Actually, the difficult part of RL is the algorithms themselves, not the script-like part of the experiment where things are put together (see the main function in our example). With a framework, in the sense of Kovacs and Egginton (2011), the experiment is not directly accessible to the user programs, since it is handled by some libraries in order to offer graphical interface or analyzing tools. The user code is then called by the framework when required. We advocate that allowing the user to call the rllib functionality at his/her convenience provides an open and extensible access to RL for students, researchers and engineers. Last, the rllib fits the requirements expressed by Kovacs and Egginton (2011, Section 4.3): support of good scientific research, formulation compliant with the domain, allowing for any kind of agents and any kind of approximators, interoperability of components (the Q function of the example can be used for different algorithms and agents), maximization of run-time speed (use of C++ and templates that inline massively the code), open source, etc. Extensions of rllib can be considered, for example for handling POMDPs, and contributions of users are expected. The use of templates is unfortunately unfamiliar to many programmers, but the effort is worth it, since it brings the code at the level of the mathematical formalism, increasing readability (by a rational use of typedefs) and reducing bugs. Even if the approach is dramatically different from existing frameworks, wrappings with frameworks can be considered in further development. References Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 3rd (20052007) edition, 1995. GSL, 2011. http://http://www.gnu.org/software/gsl. Tim Kovacs and Robert Egginton. On the analysis and design of software for reinforcement learning, with a survey of existing systems. Machine Learning, 84:7–49, 2011. rllib, 2011. http://ims.metz.supelec.fr/spip.php?article122. 628

2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

Author: Daniil Ryabko, Jérémie Mary

Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions

3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

Author: Krzysztof Chalupka, Christopher K. I. Williams, Iain Murray

Abstract: Gaussian process (GP) predictors are an important component of many Bayesian approaches to machine learning. However, even a straightforward implementation of Gaussian process regression (GPR) requires O(n2 ) space and O(n3 ) time for a data set of n examples. Several approximation methods have been proposed, but there is a lack of understanding of the relative merits of the different approximations, and in what situations they are most useful. We recommend assessing the quality of the predictions obtained as a function of the compute time taken, and comparing to standard baselines (e.g., Subset of Data and FITC). We empirically investigate four different approximation algorithms on four different prediction problems, and make our code available to encourage future comparisons. Keywords: Gaussian process regression, subset of data, FITC, local GP

4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

Author: Kamalika Chaudhuri, Anand D. Sarwate, Kaushik Sinha

Abstract: The principal components analysis (PCA) algorithm is a standard tool for identifying good lowdimensional approximations to high-dimensional data. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method. Keywords: differential privacy, principal components analysis, dimension reduction

6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

Author: Xin Tong

Abstract: The Neyman-Pearson (NP) paradigm in binary classification treats type I and type II errors with different priorities. It seeks classifiers that minimize type II error, subject to a type I error constraint under a user specified level α. In this paper, plug-in classifiers are developed under the NP paradigm. Based on the fundamental Neyman-Pearson Lemma, we propose two related plug-in classifiers which amount to thresholding respectively the class conditional density ratio and the regression function. These two classifiers handle different sampling schemes. This work focuses on theoretical properties of the proposed classifiers; in particular, we derive oracle inequalities that can be viewed as finite sample versions of risk bounds. NP classification can be used to address anomaly detection problems, where asymmetry in errors is an intrinsic property. As opposed to a common practice in anomaly detection that consists of thresholding normal class density, our approach does not assume a specific form for anomaly distributions. Such consideration is particularly necessary when the anomaly class density is far from uniformly distributed. Keywords: plug-in approach, Neyman-Pearson paradigm, nonparametric statistics, oracle inequality, anomaly detection

7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca

8 jmlr-2013-A Theory of Multiclass Boosting

Author: Indraneel Mukherjee, Robert E. Schapire

Abstract: Boosting combines weak classifiers to form highly accurate predictors. Although the case of binary classification is well understood, in the multiclass setting, the “correct” requirements on the weak classifier, or the notion of the most efficient boosting algorithms are missing. In this paper, we create a broad and general framework, within which we make precise and identify the optimal requirements on the weak-classifier, as well as design the most effective, in a certain sense, boosting algorithms that assume such requirements. Keywords: multiclass, boosting, weak learning condition, drifting games

9 jmlr-2013-A Widely Applicable Bayesian Information Criterion

Author: Sumio Watanabe

Abstract: A statistical model or a learning machine is called regular if the map taking a parameter to a probability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/ log n, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model. Since WBIC can be numerically calculated without any information about a true distribution, it is a generalized version of BIC onto singular statistical models. Keywords: Bayes marginal likelihood, widely applicable Bayes information criterion

10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

Author: Philip M. Long, Rocco A. Servedio

Abstract: We consider the problem of learning an unknown large-margin halfspace in the context of parallel computation, giving both positive and negative results. As our main positive result, we give a parallel algorithm for learning a large-margin halfspace, based on an algorithm of Nesterov’s that performs gradient descent with a momentum term. We show that this algorithm can learn an unknown γ-margin halfspace over n dimensions using ˜ n · poly(1/γ) processors and running in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have an inverse quadratic running time dependence on the margin parameter γ. Our negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We prove that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized. More precisely, we show that, if the algorithm is allowed to call the weak learner multiple times in parallel within a single boosting stage, this ability does not reduce the overall number of successive stages of boosting needed for learning by even a single stage. Our proof is information-theoretic and does not rely on unproven assumptions. Keywords: PAC learning, parallel learning algorithms, halfspace learning, linear classifiers

11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries

Author: Alexander Statnikov, Nikita I. Lytkin, Jan Lemeire, Constantin F. Aliferis

Abstract: Algorithms for Markov boundary discovery from data constitute an important recent development in machine learning, primarily because they offer a principled solution to the variable/feature selection problem and give insight on local causal structure. Over the last decade many sound algorithms have been proposed to identify a single Markov boundary of the response variable. Even though faithful distributions and, more broadly, distributions that satisfy the intersection property always have a single Markov boundary, other distributions/data sets may have multiple Markov boundaries of the response variable. The latter distributions/data sets are common in practical data-analytic applications, and there are several reasons why it is important to induce multiple Markov boundaries from such data. However, there are currently no sound and efficient algorithms that can accomplish this task. This paper describes a family of algorithms TIE* that can discover all Markov boundaries in a distribution. The broad applicability as well as efficiency of the new algorithmic family is demonstrated in an extensive benchmarking study that involved comparison with 26 state-of-the-art algorithms/variants in 15 data sets from a diversity of application domains. Keywords: Markov boundary discovery, variable/feature selection, information equivalence, violations of faithfulness

12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb

Abstract: Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE. Keywords: classification, naive Bayes, attribute independence assumption, weighted naive Bayes classification

13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Specifically, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are verified in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-field (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we define its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning flows

14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

Author: Pierre Neuvial

Abstract: The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the BenjaminiHochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the p-values, as the number m of tested hypotheses grows to infinity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form m−k/(2k+1) , where k is determined by the regularity of the density of the p-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model. Keywords: multiple testing, false discovery rate, Benjamini Hochberg’s procedure, power, criticality, plug-in procedures, adaptive control, test statistics distribution, convergence rates, kernel estimators

15 jmlr-2013-Bayesian Canonical Correlation Analysis

Author: Arto Klami, Seppo Virtanen, Samuel Kaski

Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation

16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

Author: Matthew J. Johnson, Alan S. Willsky

Abstract: There is much interest in the Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) as a natural Bayesian nonparametric extension of the ubiquitous Hidden Markov Model for learning from sequential and time-series data. However, in many settings the HDP-HMM’s strict Markovian constraints are undesirable, particularly if we wish to learn or encode non-geometric state durations. We can extend the HDP-HMM to capture such structure by drawing upon explicit-duration semi-Markov modeling, which has been developed mainly in the parametric non-Bayesian setting, to allow construction of highly interpretable models that admit natural prior information on state durations. In this paper we introduce the explicit-duration Hierarchical Dirichlet Process Hidden semiMarkov Model (HDP-HSMM) and develop sampling algorithms for efficient posterior inference. The methods we introduce also provide new methods for sampling inference in the finite Bayesian HSMM. Our modular Gibbs sampling methods can be embedded in samplers for larger hierarchical Bayesian models, adding semi-Markov chain modeling as another tool in the Bayesian inference toolbox. We demonstrate the utility of the HDP-HSMM and our inference methods on both synthetic and real experiments. Keywords: Bayesian nonparametrics, time series, semi-Markov, sampling algorithms, Hierarchical Dirichlet Process Hidden Markov Model

17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Author: Nima Noorshams, Martin J. Wainwright

Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion

18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown

Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound

19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations

Author: Nemanja Djuric, Liang Lan, Slobodan Vucetic, Zhuang Wang

Abstract: We present BudgetedSVM, an open-source C++ toolbox comprising highly-optimized implementations of recently proposed algorithms for scalable training of Support Vector Machine (SVM) approximators: Adaptive Multi-hyperplane Machines, Low-rank Linearization SVM, and Budgeted Stochastic Gradient Descent. BudgetedSVM trains models with accuracy comparable to LibSVM in time comparable to LibLinear, solving non-linear problems with millions of high-dimensional examples within minutes on a regular computer. We provide command-line and Matlab interfaces to BudgetedSVM, an efficient API for handling large-scale, high-dimensional data sets, as well as detailed documentation to help developers use and further extend the toolbox. Keywords: non-linear classification, large-scale learning, SVM, machine learning toolbox

20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

Author: Fang Han, Tuo Zhao, Han Liu

Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics

21 jmlr-2013-Classifier Selection using the Predicate Depth

22 jmlr-2013-Classifying With Confidence From Incomplete Information

23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

24 jmlr-2013-Comment on "Robustness and Regularization of Support Vector Machines" by H. Xu et al. (Journal of Machine Learning Research, vol. 10, pp. 1485-1510, 2009)

25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

32 jmlr-2013-Differential Privacy for Functions and Functional Data

33 jmlr-2013-Dimension Independent Similarity Computation

34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming

41 jmlr-2013-Experiment Selection for Causal Discovery

42 jmlr-2013-Fast Generalized Subset Scan for Anomalous Pattern Detection

43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

44 jmlr-2013-Finding Optimal Bayesian Networks Using Precedence Constraints

45 jmlr-2013-GPstuff: Bayesian Modeling with Gaussian Processes

46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

51 jmlr-2013-Greedy Sparsity-Constrained Optimization

52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections

56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

58 jmlr-2013-Language-Motivated Approaches to Action Recognition

59 jmlr-2013-Large-scale SVD and Manifold Learning

60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents

61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction

62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion

63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars

64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

65 jmlr-2013-Lower Bounds and Selectivity of Weak-Consistent Policies in Stochastic Multi-Armed Bandit Problem

66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

68 jmlr-2013-Machine Learning with Operational Costs

69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

73 jmlr-2013-Multicategory Large-Margin Unified Machines

74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

76 jmlr-2013-Nonparametric Sparsity and Regularization

77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

78 jmlr-2013-On the Learnability of Shuffle Ideals

79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features

81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality

82 jmlr-2013-Optimally Fuzzy Temporal Memory

83 jmlr-2013-Orange: Data Mining Toolbox in Python

84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models

86 jmlr-2013-Parallel Vector Field Embedding

87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris

88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models

89 jmlr-2013-QuantMiner for Mining Quantitative Association Rules

90 jmlr-2013-Quasi-Newton Method: A New Direction

91 jmlr-2013-Query Induction with Schema-Guided Pruning Strategies

92 jmlr-2013-Random Spanning Trees and the Prediction of Weighted Graphs

93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

94 jmlr-2013-Ranked Bandits in Metric Spaces: Learning Diverse Rankings over Large Document Collections

95 jmlr-2013-Ranking Forests

96 jmlr-2013-Regularization-Free Principal Curve Estimation

97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes

98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory

104 jmlr-2013-Sparse Single-Index Model

105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

106 jmlr-2013-Stationary-Sparse Causality Network Learning

107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

108 jmlr-2013-Stochastic Variational Inference

109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion

111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

114 jmlr-2013-The Rate of Convergence of AdaBoost

115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

117 jmlr-2013-Universal Consistency of Localized Versions of Regularized Kernel Methods

118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

120 jmlr-2013-Variational Algorithms for Marginal MAP

121 jmlr-2013-Variational Inference in Nonconjugate Models