jmlr jmlr2013 jmlr2013-20 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA Editor: Tong Zhang Abstract We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). [sent-4, score-0.153]
2 The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. [sent-5, score-0.319]
3 To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. [sent-7, score-0.431]
4 In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. [sent-8, score-0.48]
5 These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. [sent-10, score-0.427]
6 Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics 1. [sent-11, score-0.772]
7 Introduction High dimensional classification is of great interest to both computer scientists and statisticians. [sent-12, score-0.106]
8 Bickel and Levina (2004) show that the classical low dimensional normal-based linear discriminant analysis (LDA) is asymptotically equivalent to random guess when the dimension d increases fast compared to the sample size n, even if the Gaussian assumption is correct. [sent-13, score-0.398]
9 To handle this problem, a sparsity condition is commonly added, resulting in many follow-up works in recent years. [sent-14, score-0.032]
10 A variety of methods in sparse linear discriminant analysis, including the nearest shrunken centroids (Tibshirani et al. [sent-15, score-0.353]
11 , 2002; Wang and Zhu, 2007) and feature annealed independence rules (Fan and Fan, 2008), are based on a working independence assumption. [sent-16, score-0.119]
12 Recently, numerous alternative approaches have been proposed by taking more complex covariance matrix structures into consideration (Fan et al. [sent-17, score-0.11]
13 It is well known that the Bayes rule classifies a new data point x to the second class if and only if log ψ1 (x) − log ψ0 (x) + log(π1 /π0 ) > 0. [sent-28, score-0.032]
14 It is well known then that for 2 any linear discriminant rule with respect to w ∈ Rd : gw (X) := I((X − µa )T w > 0), (2) the corresponding misclassification error is w T µd C (gw ) = 1 − Φ √ wT Σw , (3) where Φ(·) is the cumulative distribution function of the standard Gaussian. [sent-30, score-0.54]
15 In exploring discriminant rules with a similar form as Equation (2), both Tibshirani et al. [sent-32, score-0.319]
16 (2002) and Fan and Fan (2008) assume a working independence structure for Σ. [sent-33, score-0.058]
17 (2010) propose the Regularized Optimal Affine Discriminant (ROAD) approach. [sent-36, score-0.037]
18 Let Σ and µd be consistent estimators of Σ and µd . [sent-37, score-0.056]
19 wT µd =1 Later, Cai and Liu (2012) propose another version of the sparse LDA, which tries to make w close to the Bayes rule’s linear term Σ−1 µd in the ℓ∞ norm (detailed definitions are provided in the next section), that is, min{ ||w||1 , subject to ||Σw − µd ||∞ ≤ λn }. [sent-39, score-0.064]
20 w (4) Equation (4) turns out to be a linear programming problem highly related to the Dantzig selector (Candes and Tao, 2007; Yuan, 2010; Cai et al. [sent-40, score-0.027]
21 (2012) propose another version of the sparse linear discriminant analysis based on an equivalent least square formulation of the LDA. [sent-43, score-0.356]
22 In brief, to avoid the “the curse of dimensionality”, an ℓ1 penalty is added in all three 630 C OPULA D ISCRIMINANT A NALYSIS methods to encourage a sparsity pattern of w, and hence nice theoretical properties can be obtained under certain regularity conditions. [sent-45, score-0.063]
23 There are three issues with regard to high dimensional linear discriminant analysis: (1) How to estimate Σ and Σ−1 accurately and efficiently (Rothman et al. [sent-47, score-0.398]
24 , 2010); (2) How to incorporate the covariance estimator to classification (Fan et al. [sent-51, score-0.151]
25 In this paper, we propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA), which addresses all the above three questions. [sent-55, score-0.153]
26 To handle non-Gaussian data, we extend the underlying conditional distributions of (X|Y = 0) and (X|Y = 1) from Gaussian to the larger nonparanormal family (Liu et al. [sent-56, score-0.374]
27 , Xd )T belongs to a nonparanormal family if and only if there exists a set of univariate strictly increasing functions { f j }d such that ( f1 (X1 ), . [sent-61, score-0.409]
28 To estimate Σ and Σ−1 robustly and efficiently, instead of estimating the transformation functions { f j }d as Liu et al. [sent-65, score-0.122]
29 (2009) did, we exploit the nonparametric rank-based correlation coeffij=1 cient estimators including the Spearman’s rho and Kendall’s tau, which are invariant to the strictly increasing functions f j . [sent-66, score-0.303]
30 They have been shown to enjoy the optimal parametric rate in estimating the correlation matrix (Liu et al. [sent-67, score-0.18]
31 To incorporate the estimated covariance matrix into high dimensional classification, we show that the ROAD (Fan et al. [sent-70, score-0.224]
32 Unlike the parametric cases, one new challenge for the CODA is that the rank-based covariance matrix estimator may not be positive semidefinite which makes the objective function nonconvex. [sent-73, score-0.197]
33 To solve this problem, we first project the estimated covariance matrix into the cone of positive semidefinite matrices (using elementwise sup-norm). [sent-74, score-0.165]
34 Finally, to show that the expected misclassification error is consistent to the Bayes risk, we quantify the difference between the CODA classifier gnpn and the Bayes rule g∗ . [sent-76, score-0.12]
35 To this end, we measure the convergence rate of the estimated transformation function { f j }d to the true transformation j=1 function { f j }d . [sent-77, score-0.122]
36 Under certain regularity conditions, we show that j=1 γ sup | f j − f j | = OP n− 2 , ∀ j ∈ {1, 2, . [sent-78, score-0.031]
37 In the next section, we briefly review the nonparanormal estimators (Liu et al. [sent-89, score-0.43]
38 We give a theoretical analysis of the CODA estimator in Section 4, with more detailed proofs collected in the appendix. [sent-92, score-0.037]
39 , vd )T and M− j,−k be the matrix with M’s j-th row and k-th column removed, M j,−k be M’s j-th row with the k-th column removed, M− j,k be M’s k-th column with the j-th row removed. [sent-107, score-0.099]
40 Moreover, v’s subvector with entries indexed by I is denoted by vI , M’s submatrix with rows indexed by I and columns indexed by J is denoted by MIJ , M’s submatrix with all rows and columns indexed by J is denoted by M·J , M’s submatrix with all rows and columns indexed by J is denoted by MJ . [sent-108, score-0.788]
41 For 0 < q < ∞, we define d ||v0 || := card(support(v)), ||v||q := ∑ |vi |q i=1 1/q , and ||v||∞ := max |vi |. [sent-109, score-0.029]
42 1≤i≤d We define the matrix ℓmax norm as the elementwise maximum value: ||M||max := max{|Mi j |} and n the ℓ∞ norm as ||M||∞ = max ∑ |Mi j |. [sent-110, score-0.169]
43 We define the matrix operator norm as ||M||op := λmax (M). [sent-112, score-0.058]
44 , Xd )T is said to follow a nonparanormal distribution if and only if there exists a set of univariate strictly increasing transformations f = { f j }d such that: j=1 f (X) = ( f1 (X1 ), . [sent-117, score-0.409]
45 , µd )T , Σ = [Σ jk ], Ω = Σ−1 , Σ0 = [Σ0 ] are the mean, covariance, concentrajk tion and correlation matrices of the Gaussian distribution Z. [sent-123, score-0.179]
46 (2009) prove that the nonparanormal is highly related to the Gaussian Copula (Clemen and Reilly, 1999; Klaassen and Wellner, 1997). [sent-129, score-0.374]
47 (2009) suggest a normal-score based correlation coefficient matrix to estimate Σ0 . [sent-132, score-0.132]
48 , xn ) := Tδn (5) to be the winsorized empirical cumulative distribution function of X j . [sent-143, score-0.085]
49 In particular, the empirical cumulative distribution function Fj (t; x1 , . [sent-145, score-0.054]
50 Let Φ−1 (·) be the quantile function of standard Gaussian, we define f j (t) = Φ−1 Fj (t) , and the corresponding sample correlation estimator Rns = [Rns ] to be: jk Rns := jk 1 n ∑ f j (xi j ) fk (xik ) n i=1 1 n 2 ∑ f (xi j ) · n i=1 j 1 n 2 ∑ f (xik ) n i=1 k Liu et al. [sent-152, score-0.323]
51 (2009) suggest to use the truncation level δn = ||Rns − Σ0 ||max = O p . [sent-153, score-0.029]
52 (2012) propose a different approach for estimating the correlations, called the Nonparanormal SKEPTIC. [sent-156, score-0.064]
53 The Nonparanormal SKEPTIC exploits the Spearman’s rho and Kendall’s tau to directly estimate the unknown correlation matrix. [sent-157, score-0.397]
54 1 n In specific, let ri j be the rank of xi j among x1 j , . [sent-158, score-0.091]
55 We consider the ¯ n i=1 following two statistics: (Spearman’s rho) ρ jk = (Kendall’s tau) τ jk = ¯ ¯ ∑n (ri j − r j )(rik − rk ) i=1 , n 2 · ∑n (r − r )2 ¯ ¯k ∑i=1 (ri j − r j ) i=1 ik 2 ∑ sign xi j − xi′ j (xik − xi′ k ) . [sent-162, score-0.243]
wordName wordTfidf (topN-words)
[('coda', 0.412), ('nonparanormal', 0.374), ('discriminant', 0.319), ('copula', 0.236), ('liu', 0.182), ('rho', 0.175), ('rns', 0.175), ('fj', 0.15), ('tau', 0.15), ('fan', 0.135), ('gw', 0.135), ('mai', 0.131), ('kendall', 0.124), ('spearman', 0.124), ('xik', 0.112), ('han', 0.11), ('jk', 0.107), ('xd', 0.088), ('gnpn', 0.088), ('iscriminant', 0.088), ('jeon', 0.088), ('opula', 0.088), ('shao', 0.088), ('tuo', 0.088), ('wt', 0.084), ('cai', 0.081), ('indexed', 0.08), ('covariance', 0.079), ('dimensional', 0.079), ('road', 0.078), ('correlation', 0.072), ('vd', 0.068), ('hao', 0.067), ('rd', 0.063), ('ri', 0.062), ('baltimore', 0.062), ('transformation', 0.061), ('tibshirani', 0.061), ('fd', 0.058), ('johns', 0.058), ('misclassi', 0.057), ('submatrix', 0.057), ('bayes', 0.057), ('estimators', 0.056), ('elementwise', 0.055), ('op', 0.055), ('fang', 0.055), ('cumulative', 0.054), ('hopkins', 0.052), ('nalysis', 0.052), ('parametric', 0.05), ('princeton', 0.048), ('iu', 0.047), ('lda', 0.047), ('md', 0.044), ('classi', 0.039), ('zhao', 0.038), ('levina', 0.037), ('npn', 0.037), ('reilly', 0.037), ('rik', 0.037), ('rothman', 0.037), ('wellner', 0.037), ('xid', 0.037), ('estimator', 0.037), ('named', 0.037), ('propose', 0.037), ('incorporate', 0.035), ('univariate', 0.035), ('rows', 0.034), ('card', 0.034), ('centroids', 0.034), ('robustly', 0.034), ('witten', 0.034), ('sparsity', 0.032), ('semide', 0.032), ('rule', 0.032), ('xn', 0.031), ('gaussian', 0.031), ('regularity', 0.031), ('annealed', 0.031), ('mj', 0.031), ('scheinberg', 0.031), ('subvector', 0.031), ('matrix', 0.031), ('vi', 0.03), ('independence', 0.03), ('mi', 0.03), ('max', 0.029), ('estimations', 0.029), ('xi', 0.029), ('suggest', 0.029), ('var', 0.029), ('working', 0.028), ('removed', 0.028), ('columns', 0.028), ('estimating', 0.027), ('scientists', 0.027), ('selector', 0.027), ('norm', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
2 0.19606315 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
3 0.079401754 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
4 0.063187093 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
Author: Tony Cai, Jianqing Fan, Tiefeng Jiang
Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either fixed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere
5 0.061193813 32 jmlr-2013-Differential Privacy for Functions and Functional Data
Author: Rob Hall, Alessandro Rinaldo, Larry Wasserman
Abstract: Differential privacy is a rigorous cryptographically-motivated characterization of data privacy which may be applied when releasing summaries of a database. Previous work has focused mainly on methods for which the output is a finite dimensional vector, or an element of some discrete set. We develop methods for releasing functions while preserving differential privacy. Specifically, we show that adding an appropriate Gaussian process to the function of interest yields differential privacy. When the functions lie in the reproducing kernel Hilbert space (RKHS) generated by the covariance kernel of the Gaussian process, then the correct noise level is established by measuring the “sensitivity” of the function in the RKHS norm. As examples we consider kernel density estimation, kernel support vector machines, and functions in RKHSs. Keywords: differential privacy, density estimation, Gaussian processes, reproducing kernel Hilbert space
6 0.046304107 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
7 0.045848977 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
8 0.041307107 22 jmlr-2013-Classifying With Confidence From Incomplete Information
9 0.040986307 90 jmlr-2013-Quasi-Newton Method: A New Direction
10 0.037571859 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
11 0.035933349 73 jmlr-2013-Multicategory Large-Margin Unified Machines
12 0.033934847 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
13 0.033579431 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
14 0.03329904 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
15 0.032035973 95 jmlr-2013-Ranking Forests
16 0.029853133 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
17 0.02831633 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
18 0.028173886 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting
19 0.028063724 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
20 0.028010409 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
topicId topicWeight
[(0, -0.166), (1, 0.02), (2, 0.046), (3, 0.043), (4, 0.133), (5, -0.017), (6, 0.029), (7, 0.118), (8, 0.121), (9, 0.079), (10, -0.067), (11, 0.002), (12, -0.007), (13, -0.191), (14, -0.161), (15, -0.048), (16, 0.276), (17, -0.124), (18, -0.132), (19, -0.173), (20, 0.038), (21, 0.063), (22, 0.158), (23, -0.024), (24, 0.142), (25, -0.047), (26, -0.206), (27, -0.108), (28, -0.169), (29, 0.106), (30, 0.106), (31, -0.105), (32, -0.084), (33, -0.094), (34, -0.055), (35, 0.19), (36, -0.114), (37, 0.011), (38, 0.109), (39, -0.133), (40, -0.103), (41, 0.036), (42, 0.046), (43, -0.148), (44, -0.036), (45, -0.213), (46, 0.026), (47, 0.015), (48, 0.003), (49, -0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.94492054 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
2 0.60594171 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
Author: Naftali Harris, Mathias Drton
Abstract: The PC algorithm uses conditional independence tests for model selection in graphical modeling with acyclic directed graphs. In Gaussian models, tests of conditional independence are typically based on Pearson correlations, and high-dimensional consistency results have been obtained for the PC algorithm in this setting. Analyzing the error propagation from marginal to partial correlations, we prove that high-dimensional consistency carries over to a broader class of Gaussian copula or nonparanormal models when using rank-based measures of correlation. For graph sequences with bounded degree, our consistency result is as strong as prior Gaussian results. In simulations, the ‘Rank PC’ algorithm works as well as the ‘Pearson PC’ algorithm for normal data and considerably better for non-normal data, all the while incurring a negligible increase of computation time. While our interest is in the PC algorithm, the presented analysis of error propagation could be applied to other algorithms that test the vanishing of low-order partial correlations. Keywords: Gaussian copula, graphical model, model selection, multivariate normal distribution, nonparanormal distribution
3 0.33546937 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
Author: Tony Cai, Jianqing Fan, Tiefeng Jiang
Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either fixed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere
4 0.27585796 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
5 0.26795682 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
6 0.25176013 22 jmlr-2013-Classifying With Confidence From Incomplete Information
7 0.23030518 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
8 0.21461174 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
9 0.20877922 90 jmlr-2013-Quasi-Newton Method: A New Direction
10 0.18082906 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
11 0.17997003 32 jmlr-2013-Differential Privacy for Functions and Functional Data
12 0.17097168 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java
13 0.15600753 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
14 0.15062556 88 jmlr-2013-Perturbative Corrections for Approximate Inference in Gaussian Latent Variable Models
15 0.14242151 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
16 0.14046714 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
17 0.14015126 60 jmlr-2013-Learning Bilinear Model for Matching Queries and Documents
18 0.13887799 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
19 0.13792148 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
20 0.13203348 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
topicId topicWeight
[(0, 0.018), (5, 0.18), (6, 0.023), (10, 0.046), (20, 0.019), (23, 0.025), (30, 0.412), (68, 0.017), (70, 0.026), (75, 0.061), (85, 0.037), (87, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.69599473 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
2 0.43323275 15 jmlr-2013-Bayesian Canonical Correlation Analysis
Author: Arto Klami, Seppo Virtanen, Samuel Kaski
Abstract: Canonical correlation analysis (CCA) is a classical method for seeking correlations between two multivariate data sets. During the last ten years, it has received more and more attention in the machine learning community in the form of novel computational formulations and a plethora of applications. We review recent developments in Bayesian models and inference methods for CCA which are attractive for their potential in hierarchical extensions and for coping with the combination of large dimensionalities and small sample sizes. The existing methods have not been particularly successful in fulfilling the promise yet; we introduce a novel efficient solution that imposes group-wise sparsity to estimate the posterior of an extended model which not only extracts the statistical dependencies (correlations) between data sets but also decomposes the data into shared and data set-specific components. In statistics literature the model is known as inter-battery factor analysis (IBFA), for which we now provide a Bayesian treatment. Keywords: Bayesian modeling, canonical correlation analysis, group-wise sparsity, inter-battery factor analysis, variational Bayesian approximation
3 0.4331775 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
4 0.43296573 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
5 0.43103731 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
6 0.42753601 73 jmlr-2013-Multicategory Large-Margin Unified Machines
7 0.42619675 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
8 0.42611304 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
9 0.42401114 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
10 0.42322218 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
11 0.42204034 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
12 0.42147228 87 jmlr-2013-Performance Bounds for λ Policy Iteration and Application to the Game of Tetris
13 0.42033654 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
14 0.41997072 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
15 0.4199267 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
17 0.41889557 8 jmlr-2013-A Theory of Multiclass Boosting
18 0.41768619 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
19 0.41678655 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
20 0.41658583 68 jmlr-2013-Machine Learning with Operational Costs