jmlr jmlr2013 jmlr2013-7 knowledge-graph by maker-knowledge-mining

7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression


Source: pdf

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Journal of Machine Learning Research 14 (2013) 1505-1511 Submitted 5/12; Revised 3/13; Published 6/13 A Risk Comparison of Ordinary Least Squares vs Ridge Regression Paramveer S. [sent-1, score-0.062]

2 This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). [sent-13, score-1.725]

3 Suppose that: Y = Xβ + ε, where ε is independent noise in each coordinate, with the variance of εi being σ2 . [sent-17, score-0.026]

4 The expected loss of a vector β estimator is: 1 L(β) = EY [ Y − Xβ 2 ], n ˆ Let β be an estimator of β (constructed with a sample Y ). [sent-19, score-0.18]

5 D HILLON , F OSTER , K AKADE AND U NGAR we have that the risk (i. [sent-25, score-0.5]

6 , expected excess loss) is: ˆ ˆ ˆ Risk(β) := Eβ [L(β) − L(β)] = Eβ β − β ˆ ˆ 2 Σ, where x Σ = x⊤ Σx and where the expectation is with respect to the randomness in Y . [sent-27, score-0.031]

7 We show that a simple variant of ordinary (un-regularized) least squares always compares favorably to ridge regression (as measured by the risk). [sent-28, score-0.784]

8 This observation is based on the following bias variance decomposition: ˆ ˆ ¯ Risk(β) = E β − β ¯ β−β 2 Σ+ Variance 2 Σ , (1) Prediction Bias ¯ ˆ where β = E[β]. [sent-29, score-0.063]

9 1 The Risk of Ridge Regression (RR) Ridge regression or Tikhonov Regularization (Tikhonov, 1963) penalizes the ℓ2 norm of a parameter vector β and “shrinks” it towards zero, penalizing large values more. [sent-31, score-0.119]

10 The estimator is: ˆ βλ = argmin{ Y − Xβ 2 β + λ β 2 }. [sent-32, score-0.09]

11 n Note that ˆ ˆ β0 = βλ=0 = argmin{ Y − Xβ 2 }, β is the ordinary least squares estimator. [sent-34, score-0.283]

12 Without loss of generality, rotate X such that: Σ = diag(λ1 , λ2 , . [sent-35, score-0.024]

13 To see the nature of this shrinkage observe that: ˆ [βλ ] j := λj ˆ [β0 ] j , λj +λ ˆ where β0 is the ordinary least squares estimator. [sent-39, score-0.299]

14 Ordinary Least Squares with PCA (PCA-OLS) Now let us construct a simple estimator based on λ. [sent-43, score-0.09]

15 Note that our rotated coordinate system where Σ is equal to diag(λ1 , λ2 , . [sent-44, score-0.056]

16 Consider the following ordinary least squares estimator on the “top” PCA subspace — it uses the least squares estimate on coordinate j if λ j ≥ λ and 0 otherwise ˆ [β0 ] j 0 ˆ [βPCA,λ ] j = if λ j ≥ λ . [sent-48, score-0.551]

17 otherwise The following claim shows this estimator compares favorably to the ridge estimator (for every λ)– no matter how the λ is chosen, for example, using cross validation or any other strategy. [sent-49, score-0.589]

18 Proof Using the bias variance decomposition of the risk we can write the risk as: ˆ Risk(βPCA,λ ) = σ2 ½ + ∑ λ j β2 . [sent-52, score-1.09]

19 j n ∑ λ j ≥λ j:λ <λ j j The first term represents the variance and the second the bias. [sent-53, score-0.026]

20 We now show that the jth term in the expression for the PCA risk is within a factor 4 of the jth term of the ridge regression risk. [sent-55, score-1.07]

21 First, let’s consider the case when λ j ≥ λ, then the ratio of jth terms is: λj λ j +λ σ2 n σ2 n 2 + β2 j ≤ λj (1+ σ2 n 2 λj λ j +λ σ2 n λj 2 λ ) = 1+ 2 λ λj ≤ 4. [sent-56, score-0.218]

22 Similarly, if λ j < λ, the ratio of the jth terms is: λ j β2 j σ2 n 2 λj + β2 j λ j +λ λj (1+ λj 2 λ ) ≤ λ j β2 j λ j β2 j (1+ = 1+ λj 2 λ ) λj λ 2 ≤ 4. [sent-57, score-0.218]

23 Since, each term is within a factor of 4 the proof is complete. [sent-58, score-0.036]

24 It is worth noting that the converse is not true and the ridge regression estimator (RR) can be arbitrarily worse than the PCA-OLS estimator. [sent-59, score-0.592]

25 An example which shows that the left hand inequality is tight is given in the Appendix. [sent-60, score-0.012]

26 Risk Inflation has also been used as a criterion for evaluating feature selection procedures (Foster and George, 1994). [sent-62, score-0.014]

27 Experiments First, we generated synthetic data with p = 100 and varying values of n= {20, 50, 80, 110}. [sent-64, score-0.021]

28 As can be seen, the risk ratio of PCA (PCA-OLS) and ridge regression (RR) is never worse than 4 and often its better than 1 as dictated by Theorem 2. [sent-74, score-1.116]

29 Next , we chose two real world data sets, namely USPS (n=1500, p=241) and BCI (n=400, p=117). [sent-75, score-0.042]

30 2 Since we do not know the true model for these data sets, we used all the n observations to fit an OLS regression and used it as an estimate of the true parameter β. [sent-76, score-0.075]

31 This is a reasonable approximation to the true parameter as we estimate the ridge regression (RR) and PCA-OLS models on a small subset of these observations. [sent-77, score-0.406]

32 Next we choose a random subset of the observations, namely 0. [sent-78, score-0.033]

33 8 × p to fit the ridge regression (RR) and PCA-OLS models. [sent-81, score-0.406]

34 As can be seen, the risk ratio of PCA-OLS to ridge regression (RR) is again within a factor of 4 and often PCA-OLS is better, that is, the ratio < 1. [sent-83, score-1.25]

35 Conclusion We showed that the risk inflation of a particular ordinary least squares estimator (on the “top” PCA subspace) is within a factor 4 of the ridge estimator. [sent-85, score-1.24]

36 It turns out the converse is not true — this PCA estimator may be arbitrarily better than the ridge one. [sent-86, score-0.491]

37 2p 50 0 10 Lambda 20 30 40 50 Lambda Figure 1: Plots showing the risk ratio as a function of λ, the regularization parameter and n, for the synthetic data set. [sent-115, score-0.716]

38 The error bars correspond to one standard deviation for 100 such random trials. [sent-117, score-0.021]

39 2p 0 10 20 30 Lambda Figure 2: Plots showing the risk ratio as a function of λ, the regularization parameter and n, for two real world data sets (BCI and USPS–top to bottom). [sent-145, score-0.719]

40 The risk for RR can be arbitrarily worse than the PCA-OLS estimator. [sent-147, score-0.554]

41 , 1) for some (α > 0) and also choose β = [2 + α, 0, . [sent-155, score-0.015]

42 Then, using Lemma 1, we get the risk of RR estimator as   ˆ Risk(βλ ) =   1+α 1+α+λ 2 +  (p − 1)   + (2 + α)2 × (1 + α) . [sent-160, score-0.59]

43 ˆλ such that p − 1 = (1 + α) The PCA-OLS risk (From Theorem 2) is: ˆ Risk(βPCA,λ ) = ∑ ½λ j ≥λ + j ∑ λ j β2 . [sent-165, score-0.5]

44 j j:λ j <λ Considering λ ∈ (1, 1 + α), the first term will contribute 1 to the risk and rest everything will be 0. [sent-166, score-0.537]

45 So the risk of PCA-OLS is 1 and the risk ratio is ˆ Risk(βPCA,λ ) 1 . [sent-167, score-1.154]

46 ≤ ˆ (1 + α) Risk(βλ ) Now, for large α, the risk ratio ≈ 0. [sent-168, score-0.654]

47 Solution of incorrectly formulated problems and the regularization method. [sent-179, score-0.047]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('risk', 0.5), ('lambda', 0.36), ('ridge', 0.331), ('pca', 0.328), ('rr', 0.202), ('ordinary', 0.169), ('ratio', 0.154), ('foster', 0.15), ('ation', 0.138), ('akade', 0.117), ('hillon', 0.117), ('idge', 0.117), ('ngar', 0.117), ('oster', 0.117), ('rdinary', 0.117), ('quares', 0.1), ('squares', 0.099), ('estimator', 0.09), ('east', 0.09), ('isk', 0.09), ('upenn', 0.083), ('dhillon', 0.083), ('cis', 0.078), ('lyle', 0.078), ('paramveer', 0.078), ('ungar', 0.078), ('egression', 0.078), ('omparison', 0.078), ('regression', 0.075), ('dean', 0.067), ('ols', 0.067), ('philadelphia', 0.066), ('jth', 0.064), ('ey', 0.063), ('pennsylvania', 0.063), ('vs', 0.062), ('sham', 0.06), ('tikhonov', 0.06), ('bci', 0.055), ('wharton', 0.055), ('favorably', 0.052), ('kakade', 0.049), ('diag', 0.047), ('pa', 0.045), ('usps', 0.044), ('converse', 0.042), ('bias', 0.037), ('argmin', 0.034), ('subspace', 0.032), ('coordinate', 0.032), ('dictated', 0.03), ('soviet', 0.03), ('regularization', 0.028), ('memorial', 0.028), ('mv', 0.028), ('arbitrarily', 0.028), ('decomposition', 0.027), ('compares', 0.026), ('variance', 0.026), ('worse', 0.026), ('gabor', 0.024), ('george', 0.024), ('rotate', 0.024), ('rotated', 0.024), ('world', 0.024), ('factor', 0.023), ('penalizing', 0.023), ('yi', 0.023), ('everything', 0.022), ('shrinks', 0.022), ('synthetic', 0.021), ('bars', 0.021), ('penalizes', 0.021), ('lugosi', 0.02), ('math', 0.02), ('iii', 0.02), ('incorrectly', 0.019), ('namely', 0.018), ('drive', 0.018), ('excess', 0.017), ('xi', 0.017), ('plots', 0.017), ('variant', 0.017), ('top', 0.017), ('projects', 0.017), ('lemma', 0.016), ('microsoft', 0.016), ('shrinkage', 0.016), ('least', 0.015), ('choose', 0.015), ('contribute', 0.015), ('criterion', 0.014), ('randomness', 0.014), ('convenience', 0.013), ('within', 0.013), ('showing', 0.013), ('usa', 0.013), ('denoting', 0.013), ('var', 0.013), ('tight', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca

2 0.089557193 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes

Author: Chao Zhang, Dacheng Tao

Abstract: L´ vy processes refer to a class of stochastic processes, for example, Poisson processes and Browe nian motions, and play an important role in stochastic processes and machine learning. Therefore, it is essential to study risk bounds of the learning process for time-dependent samples drawn from a L´ vy process (or briefly called learning process for L´ vy process). It is noteworthy that samples e e in this learning process are not independently and identically distributed (i.i.d.). Therefore, results in traditional statistical learning theory are not applicable (or at least cannot be applied directly), because they are obtained under the sample-i.i.d. assumption. In this paper, we study risk bounds of the learning process for time-dependent samples drawn from a L´ vy process, and then analyze the e asymptotical behavior of the learning process. In particular, we first develop the deviation inequalities and the symmetrization inequality for the learning process. By using the resultant inequalities, we then obtain the risk bounds based on the covering number. Finally, based on the resulting risk bounds, we study the asymptotic convergence and the rate of convergence of the learning process for L´ vy process. Meanwhile, we also give a comparison to the related results under the samplee i.i.d. assumption. Keywords: L´ vy process, risk bound, deviation inequality, symmetrization inequality, statistical e learning theory, time-dependent

3 0.051714253 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

Abstract: This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. Extensive experiments on several synthetic and real-world data sets demonstrate the competitive empirical performance of our method. Keywords: sparse eigenvalue, power method, sparse principal component analysis, densest k-subgraph

4 0.050006192 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

Author: Kamalika Chaudhuri, Anand D. Sarwate, Kaushik Sinha

Abstract: The principal components analysis (PCA) algorithm is a standard tool for identifying good lowdimensional approximations to high-dimensional data. Many data sets of interest contain private or sensitive information about individuals. Algorithms which operate on such data should be sensitive to the privacy risks in publishing their outputs. Differential privacy is a framework for developing tradeoffs between privacy and the utility of these outputs. In this paper we investigate the theory and empirical performance of differentially private approximations to PCA and propose a new method which explicitly optimizes the utility of the output. We show that the sample complexity of the proposed method differs from the existing procedure in the scaling with the data dimension, and that our method is nearly optimal in terms of this scaling. We furthermore illustrate our results, showing that on real data there is a large performance gap between the existing method and our method. Keywords: differential privacy, principal components analysis, dimension reduction

5 0.043437172 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

6 0.03589581 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

7 0.035707217 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

8 0.033737238 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

9 0.032741252 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

10 0.029274926 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

11 0.028839402 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

12 0.027413255 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

13 0.026279254 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

14 0.025243679 106 jmlr-2013-Stationary-Sparse Causality Network Learning

15 0.025029171 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

16 0.023494843 76 jmlr-2013-Nonparametric Sparsity and Regularization

17 0.023275435 73 jmlr-2013-Multicategory Large-Margin Unified Machines

18 0.022541357 8 jmlr-2013-A Theory of Multiclass Boosting

19 0.021985676 96 jmlr-2013-Regularization-Free Principal Curve Estimation

20 0.021970836 68 jmlr-2013-Machine Learning with Operational Costs


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.105), (1, 0.05), (2, 0.027), (3, 0.013), (4, 0.02), (5, -0.075), (6, 0.006), (7, 0.049), (8, 0.096), (9, -0.031), (10, 0.064), (11, -0.0), (12, -0.025), (13, -0.023), (14, -0.04), (15, 0.013), (16, -0.112), (17, -0.21), (18, 0.044), (19, 0.235), (20, -0.06), (21, -0.096), (22, 0.185), (23, -0.096), (24, -0.017), (25, 0.244), (26, -0.208), (27, 0.123), (28, 0.202), (29, -0.049), (30, -0.026), (31, 0.049), (32, -0.111), (33, -0.121), (34, -0.055), (35, -0.167), (36, 0.16), (37, -0.029), (38, -0.122), (39, -0.126), (40, -0.043), (41, -0.035), (42, -0.097), (43, -0.081), (44, 0.003), (45, -0.047), (46, -0.12), (47, 0.015), (48, 0.113), (49, -0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98845273 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca

2 0.67109782 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes

Author: Chao Zhang, Dacheng Tao

Abstract: L´ vy processes refer to a class of stochastic processes, for example, Poisson processes and Browe nian motions, and play an important role in stochastic processes and machine learning. Therefore, it is essential to study risk bounds of the learning process for time-dependent samples drawn from a L´ vy process (or briefly called learning process for L´ vy process). It is noteworthy that samples e e in this learning process are not independently and identically distributed (i.i.d.). Therefore, results in traditional statistical learning theory are not applicable (or at least cannot be applied directly), because they are obtained under the sample-i.i.d. assumption. In this paper, we study risk bounds of the learning process for time-dependent samples drawn from a L´ vy process, and then analyze the e asymptotical behavior of the learning process. In particular, we first develop the deviation inequalities and the symmetrization inequality for the learning process. By using the resultant inequalities, we then obtain the risk bounds based on the covering number. Finally, based on the resulting risk bounds, we study the asymptotic convergence and the rate of convergence of the learning process for L´ vy process. Meanwhile, we also give a comparison to the related results under the samplee i.i.d. assumption. Keywords: L´ vy process, risk bound, deviation inequality, symmetrization inequality, statistical e learning theory, time-dependent

3 0.37770545 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications

Author: Ming-Jie Zhao, Narayanan Edakunni, Adam Pocock, Gavin Brown

Abstract: Fano’s inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon’s as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano’s result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate—we derive similar optimal thresholds for F-score and BER. Keywords: balanced error rate, F-score (Fβ -measure), cost-sensitive risk, conditional entropy, lower/upper bound

4 0.32104361 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

5 0.31659746 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

Abstract: This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. Extensive experiments on several synthetic and real-world data sets demonstrate the competitive empirical performance of our method. Keywords: sparse eigenvalue, power method, sparse principal component analysis, densest k-subgraph

6 0.29910085 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

7 0.20090355 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

8 0.19110879 55 jmlr-2013-Joint Harmonic Functions and Their Supervised Connections

9 0.18880811 96 jmlr-2013-Regularization-Free Principal Curve Estimation

10 0.16915351 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

11 0.15840131 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

12 0.15310529 68 jmlr-2013-Machine Learning with Operational Costs

13 0.14633895 73 jmlr-2013-Multicategory Large-Margin Unified Machines

14 0.14575624 106 jmlr-2013-Stationary-Sparse Causality Network Learning

15 0.13814364 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

16 0.13360474 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

17 0.13081014 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

18 0.12952207 40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming

19 0.12410644 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

20 0.11932548 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.062), (6, 0.014), (10, 0.034), (70, 0.012), (75, 0.013), (87, 0.718)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92742604 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inflation, ridge regression, pca

2 0.91139102 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

Author: Sergey Lisitsyn, Christian Widmer, Fernando J. Iglesias Garcia

Abstract: We present Tapkee, a C++ template library that provides efficient implementations of more than 20 widely used dimensionality reduction techniques ranging from Locally Linear Embedding (Roweis and Saul, 2000) and Isomap (de Silva and Tenenbaum, 2002) to the recently introduced BarnesHut-SNE (van der Maaten, 2013). Our library was designed with a focus on performance and flexibility. For performance, we combine efficient multi-core algorithms, modern data structures and state-of-the-art low-level libraries. To achieve flexibility, we designed a clean interface for applying methods to user data and provide a callback API that facilitates integration with the library. The library is freely available as open-source software and is distributed under the permissive BSD 3-clause license. We encourage the integration of Tapkee into other open-source toolboxes and libraries. For example, Tapkee has been integrated into the codebase of the Shogun toolbox (Sonnenburg et al., 2010), giving us access to a rich set of kernels, distance measures and bindings to common programming languages including Python, Octave, Matlab, R, Java, C#, Ruby, Perl and Lua. Source code, examples and documentation are available at http://tapkee.lisitsyn.me. Keywords: dimensionality reduction, machine learning, C++, open source software

3 0.69219673 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

Abstract: This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. Extensive experiments on several synthetic and real-world data sets demonstrate the competitive empirical performance of our method. Keywords: sparse eigenvalue, power method, sparse principal component analysis, densest k-subgraph

4 0.33677596 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

Author: Joshua M. Lewis, Virginia R. de Sa, Laurens van der Maaten

Abstract: Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning field have a greater impact on research practices elsewhere. Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization

5 0.31194308 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco

Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efficient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efficient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efficient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Unified Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, first order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efficient parameter selection (closed form expression of the leave one out error and efficient computations of regularization path); (3) natural and efficient extension to multiple outputs. Specific attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped files, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simplified BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have significant improvements, which make them faster and more flexible. The specification of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identifies an actual learning task, and belongs to a predefined category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all configuration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive configuration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been specifically designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus specifically on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate specification of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared filesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, briefly described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) define a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been fixed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is significantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classification performance. Note, however, that in GURLS the number of parameter in the grid search is fixed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2001. ISBN 981-238-151-1. 3205

6 0.28448084 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

7 0.27782306 59 jmlr-2013-Large-scale SVD and Manifold Learning

8 0.25894687 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

9 0.25015914 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

10 0.24104594 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

11 0.23898865 86 jmlr-2013-Parallel Vector Field Embedding

12 0.23438236 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

13 0.2268731 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

14 0.22630247 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

15 0.21544015 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

16 0.21446367 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

17 0.21123956 1 jmlr-2013-AC++Template-Based Reinforcement Learning Library: Fitting the Code to the Mathematics

18 0.20040868 83 jmlr-2013-Orange: Data Mining Toolbox in Python

19 0.19959798 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

20 0.19410932 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines