jmlr jmlr2013 jmlr2013-116 knowledge-graph by maker-knowledge-mining

116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Source: pdf

Author: Xiao-Tong Yuan, Tong Zhang

Abstract: This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. Extensive experiments on several synthetic and real-world data sets demonstrate the competitive empirical performance of our method. Keywords: sparse eigenvalue, power method, sparse principal component analysis, densest k-subgraph

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Statistics Rutgers University New Jersey, 08816, USA Editor: Hui Zou Abstract This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. [sent-4, score-0.383]

2 A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. [sent-6, score-0.24]

3 The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. [sent-7, score-0.492]

4 Keywords: sparse eigenvalue, power method, sparse principal component analysis, densest k-subgraph 1. [sent-9, score-0.656]

5 In this context, the problem (1) is also referred to as sparse principal component analysis (sparse PCA). [sent-14, score-0.187]

6 More recently, Paul and Johnstone (2012) studied an extension called multiple spike model, and proposed an augmented sparse PCA method for estimating each of the leading eigenvectors and investigated the rate of convergence of their procedure in the high dimensional setting. [sent-35, score-0.207]

7 In another recent work that is independent of ours, Ma (2013) analyzed an iterative thresholding method for recovering the sparse principal subspace. [sent-36, score-0.217]

8 (2012) analyzed the optimal convergence rate of sparse PCA and introduced an adaptive procedure for estimating the principal subspace. [sent-40, score-0.187]

9 We show that if the true matrix A has a sparse (or approximately sparse) dominant eigenvector x, then under appropriate assumptions, this algo¯ rithm can recover x when the spectral norm of sparse submatrices of the perturbation E is small. [sent-43, score-0.507]

10 We have applied the proposed method to sparse PCA and to the densest k-subgraph ﬁnding problem (with proper modiﬁcation). [sent-46, score-0.423]

11 Extensive experiments on synthetic and real-world large-scale data sets demonstrate both the competitive sparse recovering performance and the computational efﬁciency of our method. [sent-47, score-0.148]

12 It is worth mentioning that the truncated power method developed in this paper can also be applied to the smallest k-sparse eigenvalue problem given by: λmin (A, k) = min x⊤ Ax, p x∈R subject to x = 1, x 0 ≤ k, which also has many applications in machine learning. [sent-48, score-0.184]

13 We denote by Ak any k × k principal submatrix of A and by AF the principal 900 T RUNCATED P OWER M ETHOD FOR S PARSE E IGENVALUE P ROBLEMS submatrix of A with rows and columns indexed in set F. [sent-54, score-0.178]

14 §4 evaluates the relevance of our theoretical prediction and the practical performance of the proposed algorithm in applications of sparse PCA and the densest k-subgraph ﬁnding problems. [sent-66, score-0.423]

15 1 Algorithm Therefore in order to solve the spare eigenvalue problem (1) more efﬁciently, we consider an iterative procedure based on the standard power method for eigenvalue problems, while maintaining the desired sparsity for the intermediate solutions. [sent-75, score-0.203]

16 The resulting vector is then normalized to unit length, which becomes xt . [sent-81, score-0.214]

17 901 Y UAN AND Z HANG Algorithm 1: Truncated Power (TPower) Method Input : matrix A ∈ S p , initial vector x0 ∈ R p Output : xt Parameters : cardinality k ∈ {1, . [sent-89, score-0.356]

18 until Convergence; p Remark 2 Similar to the behavior of traditional power method, if A ∈ S+ , then TPower tries to ﬁnd the (sparse) eigenvector of A corresponding to the largest eigenvalue. [sent-98, score-0.166]

19 Proposition 3 If all 2k × 2k principal submatrix A2k of A are positive semideﬁnite, then the sequence {Q(xt )}t≥1 is monotonically increasing, where xt is obtained from the TPower algorithm. [sent-105, score-0.303]

20 Proof Observe that the iterate xt in TPower solves the following constrained linear optimization problem: xt = arg max L(x; xt−1 ), L(x; xt−1 ) := 2Axt−1 , x − xt−1 . [sent-106, score-0.428]

21 Since xt − xt−1 0 ≤ 2k and each 2k × 2k principal submatrix of A is positive semideﬁnite, we have (xt − xt−1 )⊤ A(xt − xt−1 ) ≥ 0. [sent-108, score-0.303]

22 By the deﬁnition of xt as the maximizer of L(x; xt−1 ) over x (subject to x = 1 and x 0 ≤ k), we have L(xt ; xt−1 ) ≥ L(xt−1 ; xt−1 ) = 0. [sent-110, score-0.214]

23 We assume that the noise matrix E is a dense p × p matrix such that its sparse submatrices have small spectral norm ρ(E, s) (see (3)) for s in the same order of k. [sent-114, score-0.199]

24 The main advantage of the sparse eigenvalue formulation (1) over the standard eigenvalue formulation is that the estimation error of its optimal solution depends on ρ(E, s) with respectively a small s = O(k) rather than ρ(E). [sent-118, score-0.254]

25 ¯ The purpose of the section is to show that if matrix A has a unique sparse (or approximately sparse) dominant eigenvector, then under suitable conditions, TPower can (approximately) recover this eigenvector from the noisy observation A. [sent-121, score-0.296]

26 degenerate, with a gap ∆λ = λ − max j>1 |λ j (A)| Moreover, assume that the eigenvector x corresponding to the dominant eigenvalue λ is sparse with ¯ ¯ cardinality k = x 0 . [sent-123, score-0.421]

27 Note that ¯ in the extreme case of s = p, this result follows directly from the standard eigenvalue perturbation analysis (which does not require Assumption 1). [sent-125, score-0.161]

28 The ﬁnal error bound is a direct generalization of standard matrix perturbation result that depends on the full matrix perturbation error ρ(E). [sent-127, score-0.246]

29 If x′ − x is sufﬁciently small, then x′ is the dominant ¯ ¯ ¯ ¯ ¯ ¯ ¯ eigenvector of a symmetric matrix A′ that is close to A; hence the theorem can be applied with the ¯ ¯ decomposition A = A′ + E ′ where E ′ = E + A − A′ . [sent-140, score-0.153]

30 It follows that under appropriate conditions, as long as we can ﬁnd an initial x0 such that ⊤ ¯ |x0 x| ≥ c(ρ(E, s) + (k/k)1/2 ) ¯ ¯ for some constant c, then 1 − |xt⊤ x| converges geometrically until xt − x = O(ρ(E, s)). [sent-145, score-0.214]

31 ¯ This result is similar to the standard eigenvector perturbation result stated in Lemma 10 of Appendix A, except that we replace the spectral error ρ(E) of the full matrix by ρ(E, s) that can be signiﬁcantly smaller when s ≪ p. [sent-146, score-0.198]

32 To our knowledge, this is the ﬁrst sparse recovery result for the sparse eigenvalue problem in a relatively general setting. [sent-147, score-0.33]

33 We may run TPower with an appropriate initial vector to obtain an approximate solution xt of error xt − x = O( ¯ ¯ k ln p/n). [sent-160, score-0.428]

34 Based on a similar spike covariance model, Ma (2013) independently presented and analyzed an iterative thresholding method for recovering sparse orthogonal principal components, using ideas related to what we present in this paper. [sent-165, score-0.321]

35 With such a k, ρ(E, s) may be relatively large and hence the theorem indicates that xt may not converge to x accurately. [sent-167, score-0.214]

36 , can be much larger than |x0 x|), we may reduce k and rerun the algorithm with a k-sparse truncation of xt as initial vector together with the reduced k. [sent-170, score-0.253]

37 In this stage, TPower reduces to the classic power method which outputs the dominant eigenvector x of A. [sent-174, score-0.169]

38 We then illustrate the effectiveness of TPower method when applied to sparse principal component analysis (sparse PCA) (in §4. [sent-184, score-0.187]

39 2) and the densest k-subgraph (DkS) ﬁnding problem (in §4. [sent-185, score-0.305]

40 By ¯ deﬁnition, δ(s) is an increasing function with respect to perturbation error ρ(E, s) and a decreasing function with respect to the gap ∆λ between the largest eigenvalue and the remaining eigenvalues. [sent-193, score-0.187]

41 We will verify the results of Theorem 4 by applying TPower to the following single spike model 905 Y UAN AND Z HANG with true covariance ¯ A = βxx⊤ + I p×p ¯¯ and empirical covariance A= 1 n ∑ xi x⊤ , n i=1 i ¯ ¯ where xi ∼ N (0, A). [sent-194, score-0.15]

42 For the true covariance matrix A, its dominant eigenvector is x with eigen¯ value β + 1, and its eigenvalue gap is ∆λ = β. [sent-195, score-0.267]

43 For each empirical covariance matrix A, we also generate an independent ˆ empirical covariance matrix Aval to select k from the candidate set K = {5, 10, 15, . [sent-213, score-0.152]

44 Sparse principal component analysis (sparse PCA) is an extension of PCA that aims at ﬁnding sparse vectors (loading vectors) capturing the maximum amount of variance in the data. [sent-246, score-0.187]

45 Another popular technique for sparse PCA is regularized sparse learning. [sent-252, score-0.236]

46 e (2010) studied a generalized power method to solve sparse PCA with a certain dual reformulation of the problem. [sent-261, score-0.19]

47 p Given a sample covariance matrix, Σ ∈ S+ (or equivalently a centered data matrix D ∈ Rn×p with n rows of p-dimensional observations vectors such that Σ = D⊤ D) and the target cardinality k, following the literature (Moghaddam et al. [sent-264, score-0.188]

48 To ﬁnd the top m rather than the top one sparse loading vectors, a common approach in the literature (d’Aspremont et al. [sent-269, score-0.148]

49 , 2006; Mackey, 2008) is to use the iterative deﬂation method for PCA: subsequent sparse loading vectors can be obtained by recursively removing the contribution of the previously found loading vectors from the covariance matrix. [sent-271, score-0.224]

50 Given zt , the update of xt is a hard-thresholding operation which selects those entries in D⊤ zt = D⊤ Dxt−1 / Dxt−1 with squared values greater than γ and then normalize the vector after truncation. [sent-288, score-0.233]

51 To update xt , TPower selects the top k entries of D⊤ Dxt−1 and then normalize the truncated vector. [sent-291, score-0.283]

52 Our method is also related to the Iterative Thresholding Sparse PCA (ITSPCA) method (Ma, 2013) which concentrates on recovering a sparse subspace of dimension m under the spike model. [sent-296, score-0.206]

53 First, the truncation strategy is different: we truncate the vector by preserving the top k largest absolute entries and setting the remaining entries to zeros, while ITSPCA truncates the vector by setting entries below a ﬁxed threshold to zeros. [sent-299, score-0.174]

54 Second, the analysis is different: TPower is analyzed under the matrix perturbation theory and thus is deterministic, while the analysis of ITSPCA focused on the convergence rate under the stochastic multiple spike model. [sent-300, score-0.181]

55 Both TPower and PathSPCA output sparse solutions with exact cardinality k. [sent-305, score-0.23]

56 2 R ESULTS O N T OY DATA S ET To illustrate the sparse recovering performance of TPower, we apply the algorithm to a synthetic data set drawn from a sparse PCA model. [sent-308, score-0.266]

57 We follow the same procedure proposed by Shen and Huang (2008) to generate random data with a covariance matrix having sparse eigenvectors. [sent-309, score-0.194]

58 To this end, a covariance matrix is ﬁrst synthesized through the eigenvalue decomposition Σ = V DV ⊤ , where the ﬁrst m columns of V ∈ R p×p are pre-speciﬁed sparse orthogonal unit vectors. [sent-310, score-0.262]

59 We generate 500 data matrices and employ the TPower method to compute two unit-norm sparse loading vectors u1 , u2 ∈ R500 , which are hopefully close to v1 and v2 . [sent-325, score-0.148]

60 The potential reason is that SPCA is initialized with the ordinary principal components which in many random data matrices are far away from the truth sparse solution. [sent-345, score-0.187]

61 Traditional PCA always fails to recover the sparse PC loadings on this data set. [sent-346, score-0.175]

62 The success of TPower and the failure of traditional PCA can be well explained by our sparse recovery result in Theorem 4 (for TPower) in comparison to the traditional eigenvector perturbation theory in Lemma 10 (for traditional PCA), which we have already discussed in §3. [sent-347, score-0.369]

63 However, the success of other methods suggests that it might be possible to prove sparse recovery results similar to Theorem 4 for some of these alternative algorithms. [sent-348, score-0.144]

64 Following these previous studies, we also cone sider to compute the ﬁrst six sparse PCs of the data. [sent-417, score-0.151]

65 Table 3 lists the six extracted PCs by TPower with cardinality setting 6-2-1-2-1-1. [sent-422, score-0.145]

66 000 x13 diaknot 0 0 0 0 0 0 Table 3: The extracted six PCs by TPower on PitProps data set with cardinality setting 6-2-1-2-1-1. [sent-461, score-0.173]

67 3 Densest k-Subgraph Finding As another concrete application, we show that with proper modiﬁcation, TPower can be applied to the densest k-subgraph ﬁnding problem. [sent-501, score-0.305]

68 Given an undirected graph G = (V, E), |V | = n, and integer 1 ≤ k ≤ n, the densest k-subgraph (DkS) problem is to ﬁnd a set of k vertices with maximum average degree in the subgraph induced by this set. [sent-502, score-0.355]

69 until Convergence; 913 Y UAN AND Z HANG √ Remark 7 By relaxing the constraint π ∈ {0, 1}n to π = k, we may convert the densest ksubgraph problem (8) to the standard sparse eigenvalue problem (1) (up to a scaling) and then directly apply TPower (in Algorithm 1) for solution. [sent-549, score-0.491]

70 ) 1500 Density Densest k-Subgraph 4 2000 1000 500 10 2 10 1 10 0 10 -1 0 10 0 2000 4000 6000 8000 10000 0 Cardinality 2000 4000 6000 8000 10000 Cardinality (d) hollywood-2009 Figure 3: Identifying densest k-subgraph on four web graphs. [sent-625, score-0.324]

71 Figure 4(d)∼4(f) illustrate the densest k-subgraph with k = 30 output by the three algorithms. [sent-635, score-0.305]

72 Visual inspection shows that the subgraph recovered by TPower-DkS is the densest among the three. [sent-640, score-0.333]

73 After discovering the densest k-subgraph, we can eliminate their nodes and edges from the graph and then apply the algorithms on the reduced graph to search for the next densest subgraph. [sent-641, score-0.61]

74 This sequential procedure can be repeated to ﬁnd multiple densest k-subgraphs. [sent-642, score-0.305]

75 Figure 4(g)∼4(i) illustrate sequentially estimated six densest 30-subgraphs by the three considered algorithms. [sent-643, score-0.338]

76 Conclusion The sparse eigenvalue problem has been widely studied in machine learning with applications such as sparse PCA. [sent-650, score-0.304]

77 TPower is a truncated power iteration method that approximately solves the nonconvex sparse eigenvalue problem. [sent-651, score-0.282]

78 Our analysis shows that when the underlying matrix has sparse eigenvectors, under proper conditions TPower can approximately recover the true sparse solution. [sent-652, score-0.291]

79 The theoretical beneﬁt of this method is that with appropriate initialization, the reconstruction quality depends on the restricted matrix perturbation error at size s that is comparable to the sparsity ¯ k, instead of the full matrix dimension p. [sent-653, score-0.174]

80 We have applied TPower to two concrete applications: sparse PCA and the densest k-subgraph ﬁnding problem. [sent-656, score-0.423]

81 To summarize, simply combing power iteration with hard-thresholding truncation provides an accurate and scalable computational method for the sparse eigenvalue problem. [sent-658, score-0.271]

82 Middle row: The densest 30-subgraph discovered by the three considered algorithms. [sent-671, score-0.305]

83 Bottom row: Sequentially discovered six densest 30subgraphs by the three considered algorithms. [sent-672, score-0.338]

84 Proof Of Theorem 4 Our proof employs several technical tools including the perturbation theory of symmetric eigenvalue problem (Lemma 9 and Lemma 10), the convergence analysis of traditional power method (Lemma 11), and the error analysis of hard-thresholding operation (Lemma 12). [sent-675, score-0.226]

85 We state the following standard result from the perturbation theory of symmetric eigenvalue problem (see, e. [sent-676, score-0.161]

86 If ρ(E, s) ≤ ∆λ/2, then the ratio ¯ of the second largest (in absolute value) to the largest eigenvalue of sub matrix AF is no more than γ(s). [sent-681, score-0.15]

87 Now let x(F), the largest eigenvector of AF , be αx + βx′ , where x ¯ ¯ 2 + β2 = 1, with eigenvalue λ′ ≥ λ − ρ(E, s). [sent-685, score-0.169]

88 919 Y UAN AND Z HANG Lemma 11 Let y be the eigenvector with the largest (in absolute value) eigenvalue of a symmetric matrix A, and let γ < 1 be the ratio of the second largest to largest eigenvalue in absolute values. [sent-693, score-0.319]

89 Next is our main lemma, which says each step of sparse power method improves eigenvector estimation. [sent-717, score-0.239]

90 From Lemma 13 we obtain that 1 − |xt⊤ x| ≤ ¯ √ ⊤ ¯ 1 − |xt⊤ x| ≤ µ 1 − |xt−1 x| + 10δ(s) ≤ ˆ ¯ ⊤ ¯ 1 − |xt−1 x|, where the ﬁrst inequality follows from |xt⊤ x| = |xt⊤ x|/ xt ≥ |xt⊤ x|. [sent-745, score-0.214]

91 High-dimensional analysis of semideﬁnite relaxiation for sparse principal components. [sent-778, score-0.187]

92 A deterministic algorithm for the densest k-subgraph problem using linear programming. [sent-789, score-0.305]

93 A direct formulation for sparse pca using semideﬁnite programming. [sent-822, score-0.234]

94 Finding densest k-subgraph via 1-mean clustering and low-dimension approximation. [sent-865, score-0.305]

95 On the distribution of the largest eigenvalue in principal components analysis. [sent-870, score-0.163]

96 Generalized power method for e a sparse principal component analysis. [sent-884, score-0.233]

97 A constant approximation algorithm for the densest ksubgraph problem on chordal graphs. [sent-906, score-0.333]

98 Augmented sparse principal component analysis for high dimensional data. [sent-933, score-0.187]

99 Consistency of sparse pca in high dimension, low sample size contexts. [sent-953, score-0.234]

100 A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. [sent-976, score-0.217]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tpower', 0.694), ('densest', 0.305), ('xt', 0.214), ('gpower', 0.204), ('dks', 0.157), ('pathspca', 0.148), ('runcated', 0.12), ('aspremont', 0.119), ('sparse', 0.118), ('pca', 0.116), ('cardinality', 0.112), ('igenvalue', 0.103), ('uan', 0.103), ('ower', 0.1), ('ethod', 0.093), ('perturbation', 0.093), ('journ', 0.083), ('pmd', 0.083), ('roblems', 0.08), ('eigenvector', 0.075), ('ospca', 0.074), ('spca', 0.074), ('parse', 0.07), ('principal', 0.069), ('eigenvalue', 0.068), ('dxt', 0.063), ('hang', 0.058), ('spike', 0.058), ('moghaddam', 0.056), ('af', 0.053), ('cpu', 0.052), ('truncate', 0.052), ('truncated', 0.05), ('dominant', 0.048), ('feige', 0.048), ('semide', 0.047), ('itspca', 0.046), ('pcs', 0.046), ('covariance', 0.046), ('power', 0.046), ('supp', 0.043), ('zou', 0.042), ('shen', 0.04), ('ravi', 0.04), ('truncation', 0.039), ('amini', 0.037), ('ax', 0.034), ('ft', 0.034), ('greedy', 0.033), ('six', 0.033), ('loadings', 0.032), ('eigenvectors', 0.031), ('matrix', 0.03), ('recovering', 0.03), ('loading', 0.03), ('witten', 0.029), ('cardinalities', 0.029), ('cities', 0.029), ('initialization', 0.028), ('ip', 0.028), ('subgraph', 0.028), ('chordal', 0.028), ('colon', 0.028), ('diaknot', 0.028), ('pitprops', 0.028), ('ringb', 0.028), ('largest', 0.026), ('reformulation', 0.026), ('pdf', 0.026), ('recovery', 0.026), ('curves', 0.026), ('recover', 0.025), ('esults', 0.025), ('jiang', 0.025), ('lymphoma', 0.024), ('travel', 0.024), ('axt', 0.024), ('subgraphs', 0.023), ('viewing', 0.022), ('vertices', 0.022), ('zx', 0.021), ('dense', 0.021), ('lemma', 0.021), ('sparsity', 0.021), ('subject', 0.02), ('submatrix', 0.02), ('please', 0.02), ('ation', 0.02), ('traditional', 0.019), ('city', 0.019), ('web', 0.019), ('entries', 0.019), ('xx', 0.019), ('airline', 0.019), ('alizadeh', 0.019), ('ardinality', 0.019), ('asahiro', 0.019), ('aval', 0.019), ('billionnet', 0.019), ('dspca', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

2 0.096695639 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same ﬂavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with ﬁxed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

3 0.053793032 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-deﬁnite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm ﬁrst estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to inﬁnity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

4 0.051714253 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

Abstract: We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a ﬁnite dimensional subspace (as speciﬁed by a principal component analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method (PCA-OLS) is within a constant factor (namely 4) of the risk of ridge regression (RR). Keywords: risk inﬂation, ridge regression, pca

5 0.0511446 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves deﬁne the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal deﬁnition leads to some technical and practical difﬁculties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difﬁculties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modiﬁcation of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection

6 0.048611209 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

7 0.040768053 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

8 0.038797561 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

9 0.036664441 106 jmlr-2013-Stationary-Sparse Causality Network Learning

10 0.036463432 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

11 0.035559252 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

12 0.035505306 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

13 0.035198051 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

14 0.032713018 104 jmlr-2013-Sparse Single-Index Model

15 0.032592922 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

16 0.032315955 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

17 0.031980772 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

18 0.030971572 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

19 0.030850342 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

20 0.029888334 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.159), (1, 0.051), (2, 0.011), (3, 0.034), (4, 0.063), (5, -0.039), (6, 0.012), (7, -0.018), (8, 0.091), (9, -0.005), (10, 0.061), (11, 0.025), (12, -0.132), (13, 0.033), (14, 0.098), (15, 0.042), (16, -0.018), (17, -0.19), (18, 0.085), (19, 0.01), (20, -0.029), (21, -0.037), (22, 0.005), (23, -0.124), (24, -0.091), (25, -0.003), (26, -0.165), (27, 0.08), (28, -0.037), (29, 0.172), (30, 0.088), (31, 0.083), (32, 0.014), (33, -0.243), (34, 0.164), (35, -0.111), (36, -0.08), (37, -0.098), (38, -0.162), (39, 0.205), (40, 0.034), (41, 0.162), (42, -0.15), (43, -0.097), (44, 0.102), (45, -0.004), (46, 0.065), (47, -0.119), (48, 0.011), (49, -0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92650187 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

2 0.58908087 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

3 0.32552877 106 jmlr-2013-Stationary-Sparse Causality Network Learning

Author: Yuejia He, Yiyuan She, Dapeng Wu

Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identiﬁcation. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efﬁcient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efﬁciency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap

4 0.32511199 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

5 0.32167429 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

6 0.27869096 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

7 0.25558308 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

8 0.22949685 16 jmlr-2013-Bayesian Nonparametric Hidden Semi-Markov Models

9 0.22163028 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

10 0.20842864 15 jmlr-2013-Bayesian Canonical Correlation Analysis

11 0.20490263 85 jmlr-2013-Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models

12 0.1990418 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

13 0.1952067 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

14 0.18956348 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

15 0.18577655 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

16 0.18566893 40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming

17 0.18462645 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

18 0.18295595 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

19 0.17608571 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

20 0.1757967 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (5, 0.102), (6, 0.046), (10, 0.07), (20, 0.025), (23, 0.031), (68, 0.032), (70, 0.021), (75, 0.034), (85, 0.019), (87, 0.498), (89, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96703815 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

Author: Sergey Lisitsyn, Christian Widmer, Fernando J. Iglesias Garcia

Abstract: We present Tapkee, a C++ template library that provides efﬁcient implementations of more than 20 widely used dimensionality reduction techniques ranging from Locally Linear Embedding (Roweis and Saul, 2000) and Isomap (de Silva and Tenenbaum, 2002) to the recently introduced BarnesHut-SNE (van der Maaten, 2013). Our library was designed with a focus on performance and ﬂexibility. For performance, we combine efﬁcient multi-core algorithms, modern data structures and state-of-the-art low-level libraries. To achieve ﬂexibility, we designed a clean interface for applying methods to user data and provide a callback API that facilitates integration with the library. The library is freely available as open-source software and is distributed under the permissive BSD 3-clause license. We encourage the integration of Tapkee into other open-source toolboxes and libraries. For example, Tapkee has been integrated into the codebase of the Shogun toolbox (Sonnenburg et al., 2010), giving us access to a rich set of kernels, distance measures and bindings to common programming languages including Python, Octave, Matlab, R, Java, C#, Ruby, Perl and Lua. Source code, examples and documentation are available at http://tapkee.lisitsyn.me. Keywords: dimensionality reduction, machine learning, C++, open source software

2 0.96673018 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Author: Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, Lyle H. Ungar

same-paper 3 0.80641156 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

4 0.47507441 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

Author: Joshua M. Lewis, Virginia R. de Sa, Laurens van der Maaten

Abstract: Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning ﬁeld have a greater impact on research practices elsewhere. Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization

5 0.46712387 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco

Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efﬁcient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efﬁcient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efﬁcient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Uniﬁed Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, ﬁrst order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efﬁcient parameter selection (closed form expression of the leave one out error and efﬁcient computations of regularization path); (3) natural and efﬁcient extension to multiple outputs. Speciﬁc attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped ﬁles, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simpliﬁed BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have signiﬁcant improvements, which make them faster and more ﬂexible. The speciﬁcation of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identiﬁes an actual learning task, and belongs to a predeﬁned category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all conﬁguration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive conﬁguration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been speciﬁcally designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus speciﬁcally on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate speciﬁcation of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared ﬁlesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, brieﬂy described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) deﬁne a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been ﬁxed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is signiﬁcantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classiﬁcation performance. Note, however, that in GURLS the number of parameter in the grid search is ﬁxed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classiﬁcation. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientiﬁc, 2001. ISBN 981-238-151-1. 3205

6 0.43791413 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

7 0.43500453 59 jmlr-2013-Large-scale SVD and Manifold Learning

8 0.41814467 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

9 0.40310484 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

10 0.40141279 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

11 0.39545244 86 jmlr-2013-Parallel Vector Field Embedding

12 0.3936204 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

13 0.38388103 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

14 0.38333049 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

15 0.37818274 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

16 0.37470764 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability

17 0.3664996 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

18 0.3495743 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

19 0.34942791 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

20 0.34799832 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression