jmlr jmlr2008 jmlr2008-75 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
Reference: text
sentIndex sentText sentNum sentScore
1 We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. [sent-10, score-0.359]
2 We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. [sent-11, score-0.291]
3 We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. [sent-12, score-0.233]
4 Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso 1. [sent-13, score-0.373]
5 Our aim here is to efficiently derive sparse principal components, that is, a set of sparse vectors that explain a maximum amount of variance. [sent-27, score-0.292]
6 (2006b) show that the subset selection problem for ordinary least squares, which is NP-hard (Natarajan, 1995), can be reduced to a sparse generalized eigenvalue problem, of which sparse PCA is a particular intance. [sent-32, score-0.343]
7 (2007b) derived an √ 4 log n) 1 based semidefinite relaxation for the sparse PCA problem (1) with a complexity of O(n for a given ρ. [sent-47, score-0.238]
8 (2006a) used greedy search and branch-and-bound methods to solve small instances of problem (1) exactly and get good solutions for larger ones. [sent-49, score-0.216]
9 Each step of this greedy algorithm has complexity O(n3 ), leading to a total complexity of O(n4 ) for a full set of solutions. [sent-50, score-0.2]
10 We first derive a greedy algorithm for computing a full set of good solutions (one for each target sparsity between 1 and n) at a total numerical cost of O(n 3 ) based on the convexity of the of the largest eigenvalue of a symmetric matrix. [sent-54, score-0.375]
11 , 2007a), providing new and simpler conditions for optimality and describing applications to subset selection and sparse recovery. [sent-59, score-0.301]
12 Sufficient conditions based on sparse eigenvalues (also called restricted isometry constants in Cand` s and Tao 2005) guarantee consistent variable selection (in the e LASSO case) or sparse recovery (in the decoding problem). [sent-63, score-0.546]
13 The results we derive here produce upper bounds on sparse extremal eigenvalues and can thus be used to prove consistency in LASSO estimation, prove perfect recovery in sparse recovery problems, or prove that a particular solution of the subset selection problem is optimal. [sent-64, score-0.659]
14 Of course, our conditions are only sufficient, not necessary and the duality bounds we produce on sparse extremal eigenvalues cannot always be tight, but we observe that the duality gap is often small. [sent-65, score-0.577]
15 In Section 4 we then formulate a convex relaxation for the sparse PCA problem, which we use in Section 5 to derive tractable sufficient conditions for the global optimality of a particular sparsity pattern. [sent-69, score-0.542]
16 In Section 6 we detail applications to subset selection, sparse recovery and variable selection. [sent-70, score-0.187]
17 For β ∈ R, we write β+ = max{β, 0} and for X ∈ Sn (the set of symmetric matrix of size n × n) with eigenvalues λi , Tr(X)+ = ∑n max{λi , 0}. [sent-74, score-0.2]
18 We consider the following sparse PCA problem: φ(ρ) ≡ max zT Σz − ρ Card(z) z ≤1 (2) in the variable z ∈ Rn where ρ > 0 is a parameter controlling sparsity. [sent-79, score-0.187]
19 We then have: φ(ρ) = max λmax (A diag(u)AT ) − ρ1T u u∈{0,1}n = max max xT A diag(u)AT x − ρ1T u = max max x =1 u∈{0,1}n n ∑ ui ((aT x)2 − ρ). [sent-100, score-0.315]
20 Note that if Σii = aT ai < ρ, we must have (aT x)2 ≤ ai 2 x 2 < ρ hence variable i will never i i be part of the optimal subset and we can remove it. [sent-103, score-0.206]
21 Greedy Solutions In this section, we focus on finding a good solution to problem (2) using greedy methods. [sent-105, score-0.203]
22 Finally, our first contribution in this section is to derive an approximate greedy algorithm that computes a full set of (approximate) solutions for problem (2), with total complexity O(n3 ). [sent-108, score-0.216]
23 This works intuitively because the diagonal is a rough proxy for the eigenvalues: the Schur-Horn theorem states that the diagonal of a matrix majorizes its eigenvalues (Horn and Johnson, 1985); sorting costs O(n log n). [sent-111, score-0.242]
24 Another quick solution is to compute the leading eigenvector of Σ and form a sparse vector by thresholding to zero the coefficients whose magnitude is smaller than a certain level. [sent-112, score-0.32]
25 The matrices ΣIk ,Ik and ∑i∈Ik ai aT have the same eigenvalues and their eigenvectors are transformed of each other i through the matrix A, that is, if z is an eigenvector of ΣIk ,Ik , then AIk z/ AIk z is an eigenvector of AIk ATk . [sent-133, score-0.507]
26 4 Computational Complexity The complexity of computing a greedy regularization path using the classic greedy algorithm in Section 3. [sent-154, score-0.34]
27 We thus get: (aT Xai − ρ)+ = Tr((aT xxT ai − ρ)xxT )+ i i = Tr(x(xT ai aT x − ρ)xT )+ i = Tr(X 1/2 ai aT X 1/2 − ρX)+ = Tr(X 1/2 (ai aT − ρI)X 1/2 )+ . [sent-189, score-0.309]
28 Note that because Bi has at most one nonnegative eigenvalue, we can replace Tr(X 1/2 ai aT X 1/2 − ρX)+ by λmax (X 1/2 ai aT X 1/2 − ρX)+ in the above proi i gram. [sent-195, score-0.206]
29 Note that we always have ψ(ρ) ≥ φ(ρ) and when the solution to the above semidefinite program has rank one, ψ(ρ) = φ(ρ) and the semidefinite relaxation (6) is tight. [sent-200, score-0.239]
30 This simple fact allows us to derive sufficient global optimality conditions for the original sparse PCA problem. [sent-201, score-0.301]
31 Optimality Conditions In this section, we derive necessary and sufficient conditions to test the optimality of solutions to the relaxations obtained in Sections 3, as well as sufficient condition for the tightness of the semidefinite relaxation in (6). [sent-203, score-0.426]
32 1 Dual Problem and Optimality Conditions We first derive the dual problem to (6) as well as the Karush-Kuhn-Tucker (KKT) optimality conditions: Lemma 2 Let A ∈ Rn×n , ρ ≥ 0 and denote by a1 , . [sent-205, score-0.217]
33 These necessary and sufficient conditions for the optimality of X = xxT for the convex relaxation then provide sufficient conditions for global optimality for the non-convex problem (2). [sent-230, score-0.531]
34 i i i∈I i∈I / Furthermore, x must be a leading eigenvector of both ∑i∈I ai aT and ∑n Yi . [sent-243, score-0.235]
35 Finally, i=1 i=1 i we recall from Section 2 that: ∑i∈I ((aT x)2 − ρ) = max i n max ∑ ui ((aT x)2 − ρ) i x =1 u∈{0,1}n i=1 = max λmax (A diag(u)AT ) − ρ1T u u∈{0,1}n hence x must also be a leading eigenvector of ∑i∈I ai aT . [sent-248, score-0.424]
36 In practice, we are only given a sparsity pattern I (using the results of Section 3 for example) rather than the vector x, but Lemma 3 also shows that given I, we can get the vector x as the leading eigenvector of ∑i∈I ai aT . [sent-251, score-0.299]
37 The original optimality conditions in (3) are highly degenerate in Yi and this result refines these optimality conditions by invoking the local structure. [sent-270, score-0.354]
38 The local optimality analysis in proposition 4 gives more specific constraints on the dual variables Yi . [sent-271, score-0.251]
39 i If (aT x)2 < ρ and x = 1, an optimal solution of the semidefinite program: i minimize TrYi TB subject to Yi Bi − Bxi xx i x i , xT Yi x = 0, Yi TB is given by: Yi = max 0, ρ (aT ai − ρ) i (ρ − (aT x)2 ) i 0, (I − xxT )ai aT (I − xxT ) i . [sent-281, score-0.232]
40 (I − xxT )ai 2 T B Proof Let us write Mi = Bi − Bxi xx i x i , we first compute: TB aT Mi ai = (aT ai − ρ)aT ai − i i i = (aT ai aT x − ρaT x)2 i i i (aT x)2 − ρ i (aT ai − ρ) i ρ(aT ai − (aT x)2 ). [sent-282, score-0.651]
41 By construction, i we have Yi 0 and Yi x = 0, and a short calculation shows that: aT Yi ai = ρ i (aT ai − ρ) T i (a ai − (aT x)2 ) i (ρ − (aT x)2 ) i i = a T Mi a i . [sent-286, score-0.309]
42 The dual of the original semidefinite program can be written: maximize Tr Pi Mi subject to I − Pi + νxxT Pi 0, 0 and the KKT optimality conditions for this problem are written: Yi (I − Pi + νxxT ) = 0, Pi (Yi − Mi ) = 0, I − Pi + νxxT 0, Pi 0, Yi 0, Yi Mi , Yi xxT = 0, i ∈ I c . [sent-289, score-0.29]
43 Because all contributions of x are zero, TrYi (Yi − Mi ) is proportional to Tr ai aT (Yi − Mi ) which is equal to zero i and Yi in (10) satisifies the KKT optimality conditions. [sent-291, score-0.24]
44 We summarize the results of this section in the theorem below, which provides sufficient optimality conditions on a sparsity pattern I. [sent-292, score-0.241]
45 O PTIMAL S OLUTIONS FOR S PARSE P RINCIPAL C OMPONENT A NALYSIS Proof Following proposition 4 and lemma 5, the matrices Yi are dual optimal solutions corresponding to the primal optimal solution X = xxT in (5). [sent-298, score-0.193]
46 Because the primal solution has rank one, the semidefinite relaxation (6) is tight so the pattern I is optimal for (2) and Section 2 shows that z is a globally optimal solution to (2) with ρ = ρ∗ . [sent-299, score-0.239]
47 As we will show below, when the dual variables Yic are defined as in (10), the duality gap in (2) is a convex function of ρ hence, given a sparsity pattern I, we can efficiently search for the best possible ρ (which must belong to an interval) by performing a few binary search iterations. [sent-302, score-0.339]
48 Given a sparsity pattern I, setting x to be the largest eigenvector of ∑i∈I ai aT , with the dual variables Yi for i ∈ I c i defined as in (10) and: Bi xxT Bi Yi = T , when i ∈ I. [sent-307, score-0.349]
49 x Bi x The duality gap in (2) which is given by: gap(ρ) ≡ λmax n ∑ Yi i=1 − ∑((aT x)2 − ρ), i i∈I is a convex function of ρ when max(aT x)2 < ρ < min(aT x)2 . [sent-308, score-0.195]
50 For i ∈ I c , we can write: (aT x)2 ρ(aT ai − ρ) i i = −ρ + (aT ai − (aT x)2 ) 1 + i i ρ − (aT x)2 ρ − (aT x)2 i i , hence max{0, ρ(aT ai − ρ)/(ρ − (aT x)2 )} is also a convex function of ρ. [sent-311, score-0.372]
51 This means that: i i uT Yi u = max 0, ρ (aT ai − ρ) i (ρ − (aT x)2 ) i (uT ai − (xT u)(xT ai ))2 (I − xxT )ai 2 is convex in ρ when i ∈ I c . [sent-312, score-0.435]
52 We conclude that ∑n uT Yi u is convex, hence: i=1 n gap(ρ) = max ∑ uT Yi u − ∑((aT x)2 − ρ) i u =1 i=1 i∈I is also convex in ρ as a pointwise maximum of convex functions of ρ. [sent-313, score-0.189]
53 We first compute x as a leading eigenvector ∑i∈I ai aT . [sent-316, score-0.235]
54 5 Solution Improvements and Randomization When these conditions are not satisfied, the relaxation (6) has an optimal solution with rank strictly larger than one, hence is not tight. [sent-335, score-0.246]
55 This is equivalent to the optimality of u for the sparse PCA problem with matrix X T yyT X − s0 X T X, which can be checked using the sparse PCA optimality conditions derived in the previous sections. [sent-352, score-0.607]
56 Note that unlike in the sparse PCA case, this convex relaxation does not immediately give a simple bound on the optimal value of the subset selection problem. [sent-353, score-0.301]
57 This is to be contrasted with the sufficient conditions derived for particular algorithms, such as the LASSO (Yuan and Lin, 2007; Zhao and Yu, 2006) or backward greedy selection (Couvreur and Bresler, 2000). [sent-356, score-0.267]
58 Our key observation here is that the restricted isometry constant δS in condition (13) can be computed by solving the following sparse maximum eigenvalue problem: (1 + δS ) ≤ max. [sent-367, score-0.293]
59 Card(x) ≤ S x = 1, 1284 O PTIMAL S OLUTIONS FOR S PARSE P RINCIPAL C OMPONENT A NALYSIS in the variable x ∈ Rm and another sparse maximum eigenvalue problem on αI − FF T with α sufficiently large, with δS computed from the tightest one. [sent-370, score-0.219]
60 , exponentially small probability of failure), whenever they are satisfied, the tractable optimality conditions and upper bounds we obtain in Section 5 for sparse PCA problems allow us to prove, deterministically, that a finite dimensional matrix satisfies the restricted isometry condition in (13). [sent-374, score-0.455]
61 Note that Cand` s and Tao (2005) provide a slightly weaker condition than (13) e based on restricted orthogonality conditions and extending the results on sparse PCA to these conditions would increase the maximum S for which perfect recovery holds. [sent-375, score-0.301]
62 (2007b) do provide very tight upper bounds on sparse eigenvalues of random matrices but solving these semidefinite programs for very large scale instances remains a significant challenge. [sent-378, score-0.314]
63 3 was 3 seconds versus 37 seconds for the full greedy solution in Section 3. [sent-389, score-0.236]
64 We can also see that both sorting and thresholding ROC curves are dominated by the greedy algorithms. [sent-392, score-0.243]
65 8 1 False Positive Rate Figure 1: ROC curves for sorting, thresholding, fully greedy solutions (Section 3. [sent-407, score-0.216]
66 We then plot the variance versus cardinality tradeoff curves for various values of the signal-tonoise ratio. [sent-410, score-0.192]
67 (2007b) to find better solutions where the greedy codes have failed to obtain globally optimal solutions. [sent-415, score-0.216]
68 In Figure 2 we plot the variance versus cardinality tradeoff curve for σ = 10. [sent-424, score-0.192]
69 We plot greedy variances (solid line), dual upper bounds from Section 5. [sent-425, score-0.318]
70 We consider data sets generated from a sparse linear regression problem and study optimality for the subset selection problem, given the exact cardinality of the generating vector. [sent-429, score-0.355]
71 We plot greedy variances (solid line), dual upper bounds from Section 5. [sent-435, score-0.318]
72 the optimality of solutions obtained from various algorithms such as the Lasso, forward greedy or backward greedy algorithms. [sent-438, score-0.58]
73 We perform 50 simulations with 1000 samples and varying noise and compute the average frequency of optimal subset selection for Lasso and greedy backward algorithm together with the frequency of provable optimality (i. [sent-440, score-0.364]
74 We can see that the backward greedy algorithm exhibits good performance (even in the Lasso-inconsistent case) and that our sufficient optimality condition is satisfied as long as there is not too much noise. [sent-443, score-0.398]
75 The plot on the left shows the results when the Lasso consistency condition is satisfied, while the plot on the right shows the mean squared errors when the consistency condition is not satisfied. [sent-445, score-0.196]
76 The two sets of figures do show that the LASSO is consistent only when the consistency condition is satisfied, while the backward greedy algorithm finds the correct pattern if the noise is small enough (Couvreur and Bresler, 2000) even in the LASSO inconsistent case. [sent-446, score-0.292]
77 2, we compute the upper and lower bounds on sparse eigenvalues produced using various algorithms. [sent-449, score-0.314]
78 We plot the probability of achieved (dotted line) and provable (solid line) optimality versus noise for greedy selection against Lasso (large dots), for the subset selection problem on a noisy sparse vector. [sent-465, score-0.497]
79 We plot the maximum sparse eigenvalue versus cardinality, obtained using exhaustive search (solid line), the approximate greedy (dotted line) and fully greedy (dashed line) algorithms. [sent-500, score-0.656]
80 We also plot the upper bounds obtained by minimizing the gap of a rank one solution (squares), by solving the semidefinite relaxation explicitly (stars) and by solving the DSPCA dual (diamonds). [sent-501, score-0.426]
81 where we pick F to be normally distributed and small enough so that computing sparse eigenvalues by exhaustive search is numerically feasible. [sent-504, score-0.31]
82 We plot the maximum sparse eigenvalue versus cardinality, obtained using exhaustive search (solid line), the approximate greedy (dotted line) and fully greedy (dashed line) algorithms. [sent-505, score-0.656]
83 We also plot the upper bounds obtained by minimizing the gap of a rank one solution (squares), by solving the semidefinite relaxation explicitly (stars) and by solving the DSPCA dual (diamonds). [sent-506, score-0.426]
84 Note however, that the duality gap between the semidefinite relaxations and the optimal solution is very small in both cases, while our bounds based on greedy solutions are not as good. [sent-513, score-0.471]
85 (2007b) could provide very tight upper bounds on sparse eigenvalues of random matrices. [sent-515, score-0.314]
86 We plot the variance versus cardinality tradeoff curve in Figure 6, together with the dual upper bounds from Section 5. [sent-522, score-0.307]
87 In Table 1, we also compare the 20 most important genes selected by the second sparse PCA factor on the colon cancer data set, with the top 10 genes selected by the RankGene software by Su et al. [sent-526, score-0.252]
88 We observe that 6 genes (out of an original 4027 genes) were both in the top 20 sparse PCA genes and in the top 10 Rankgene genes. [sent-528, score-0.208]
89 Table 1: 6 genes (out of 4027) that were both in the top 20 sparse PCA genes and in the top 10 Rankgene genes. [sent-539, score-0.208]
90 5 PSfrag replacements 0 0 50 100 150 200 250 300 350 400 450 500 Cardinality Figure 6: Variance (solid lines) versus cardinality tradeoff curve for two gene expression data sets, lymphoma (top) and colon cancer (bottom), together with dual upper bounds from Section 5. [sent-544, score-0.418]
91 Conclusion We have presented a new convex relaxation of sparse principal component analysis, and derived tractable sufficient conditions for optimality. [sent-548, score-0.385]
92 These conditions go together with efficient greedy algorithms that provide candidate solutions, many of which turn out to be optimal in practice. [sent-549, score-0.21]
93 Having n matrix variables of dimension n, the problem is of course extremely large and finding numerical algorithms to directly optimize these relaxation bounds would be an important extension of this work. [sent-552, score-0.194]
94 The proposition follows by considering a circle around λ 0 that is small enough to exclude other eigenvalues of N, and applying several times the Cauchy residue formula. [sent-567, score-0.22]
95 + (1 − t)1/2 x , t 1/2Y 1/2 which implies that the non zero eigenvalues of X(t)1/2 BX(t)1/2 are the same as the non zero eigenvalues of U(t)T BU(t). [sent-572, score-0.368]
96 The matrix M(0) has a single (and simple) non zero eigenvalue which is equal to λ 0 = xT Bx with eigenvector U = (1, 0)T . [sent-575, score-0.271]
97 Proposition 9 can be applied to the two eigenvalues of M(0): there is one eigenvalue of M(t) around x T Bx, while the n remaining ones are around zero. [sent-577, score-0.25]
98 On the optimality of the backward greedy algorithm for the subset selection problem. [sent-628, score-0.364]
99 Spectral bounds for sparse PCA: Exact and greedy algorithms. [sent-694, score-0.329]
100 Low rank approximations with sparse factors I: basic algorithms and error analysis. [sent-760, score-0.183]
wordName wordTfidf (topN-words)
[('xxt', 0.443), ('bi', 0.256), ('bx', 0.242), ('tr', 0.185), ('greedy', 0.17), ('eigenvalues', 0.155), ('xt', 0.143), ('ik', 0.14), ('optimality', 0.137), ('semide', 0.136), ('rincipal', 0.132), ('lasso', 0.125), ('sparse', 0.124), ('card', 0.123), ('haoui', 0.121), ('spremont', 0.121), ('pca', 0.121), ('relaxation', 0.114), ('yi', 0.114), ('pi', 0.114), ('uu', 0.114), ('olutions', 0.112), ('nalysis', 0.112), ('omponent', 0.112), ('sn', 0.107), ('ai', 0.103), ('eigenvector', 0.102), ('ptimal', 0.1), ('aspremont', 0.1), ('bach', 0.097), ('eigenvalue', 0.095), ('cardinality', 0.094), ('dspca', 0.088), ('psfrag', 0.084), ('rn', 0.08), ('dual', 0.08), ('diag', 0.078), ('tao', 0.077), ('parse', 0.076), ('cand', 0.075), ('ty', 0.075), ('gap', 0.072), ('replacements', 0.067), ('moghaddam', 0.065), ('sparsity', 0.064), ('convex', 0.063), ('recovery', 0.063), ('max', 0.063), ('duality', 0.06), ('rank', 0.059), ('ut', 0.057), ('backward', 0.057), ('yyt', 0.056), ('rankgene', 0.055), ('relaxations', 0.055), ('tb', 0.055), ('kkt', 0.05), ('mi', 0.048), ('zt', 0.048), ('solutions', 0.046), ('matrix', 0.045), ('principal', 0.044), ('colon', 0.044), ('jolliffe', 0.044), ('sorting', 0.042), ('genes', 0.042), ('conditions', 0.04), ('fi', 0.04), ('zk', 0.04), ('isometry', 0.04), ('fx', 0.038), ('xai', 0.037), ('sdp', 0.036), ('dotted', 0.035), ('bounds', 0.035), ('proposition', 0.034), ('condition', 0.034), ('program', 0.033), ('xx', 0.033), ('tru', 0.033), ('bxi', 0.033), ('bxxt', 0.033), ('couvreur', 0.033), ('lymphoma', 0.033), ('nonconvex', 0.033), ('tryi', 0.033), ('solution', 0.033), ('plot', 0.033), ('versus', 0.033), ('tradeoff', 0.032), ('line', 0.032), ('solid', 0.032), ('thresholding', 0.031), ('exhaustive', 0.031), ('consistency', 0.031), ('loadings', 0.031), ('extremal', 0.031), ('residue', 0.031), ('leading', 0.03), ('non', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
2 0.20539866 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont
Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data
3 0.13337222 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2 ) per trial, where n is the dimension of the instances. Keywords: principal component analysis, online learning, density matrix, expert setting, quantum Bayes rule
4 0.11801022 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
Author: David C. Hoyle
Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference
5 0.11086307 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
6 0.10623007 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
7 0.10106616 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
8 0.076862782 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
9 0.073779486 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
10 0.06848871 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
11 0.066013388 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
12 0.060681768 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
13 0.05953449 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
14 0.059210543 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
15 0.059107691 9 jmlr-2008-Active Learning by Spherical Subdivision
16 0.056097798 83 jmlr-2008-Robust Submodular Observation Selection
17 0.054628808 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
18 0.053931624 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
19 0.053907942 86 jmlr-2008-SimpleMKL
20 0.051908199 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
topicId topicWeight
[(0, 0.272), (1, -0.104), (2, -0.225), (3, 0.028), (4, 0.118), (5, 0.043), (6, -0.178), (7, 0.16), (8, -0.334), (9, -0.132), (10, -0.029), (11, -0.108), (12, -0.092), (13, 0.113), (14, 0.118), (15, -0.042), (16, 0.033), (17, -0.029), (18, -0.084), (19, 0.057), (20, 0.075), (21, 0.034), (22, 0.022), (23, 0.038), (24, -0.041), (25, 0.021), (26, 0.12), (27, -0.062), (28, 0.009), (29, 0.042), (30, 0.181), (31, -0.034), (32, 0.098), (33, 0.064), (34, -0.053), (35, -0.105), (36, 0.031), (37, -0.069), (38, -0.01), (39, -0.021), (40, 0.029), (41, 0.065), (42, -0.095), (43, 0.013), (44, 0.056), (45, -0.053), (46, 0.025), (47, -0.042), (48, 0.002), (49, -0.134)]
simIndex simValue paperId paperTitle
same-paper 1 0.97014147 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
2 0.78857803 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont
Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data
3 0.51632023 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
4 0.4150081 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
Author: Manfred K. Warmuth, Dima Kuzmin
Abstract: We design an online algorithm for Principal Component Analysis. In each trial the current instance is centered and projected into a probabilistically chosen low dimensional subspace. The regret of our online algorithm, that is, the total expected quadratic compression loss of the online algorithm minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose dependence on the dimension of the instances is only logarithmic. We first develop our methodology in the expert setting of online learning by giving an algorithm for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The running time is O(n2 ) per trial, where n is the dimension of the instances. Keywords: principal component analysis, online learning, density matrix, expert setting, quantum Bayes rule
5 0.4092474 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
Author: David C. Hoyle
Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference
6 0.35588309 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
7 0.3487446 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
8 0.33269051 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
9 0.27606624 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
10 0.26513964 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
11 0.25509977 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
12 0.24814448 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
13 0.247604 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
14 0.24638817 83 jmlr-2008-Robust Submodular Observation Selection
15 0.24127315 9 jmlr-2008-Active Learning by Spherical Subdivision
16 0.2364268 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
17 0.23139806 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
18 0.22580229 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
19 0.21454617 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion
20 0.21445522 56 jmlr-2008-Magic Moments for Structured Output Prediction
topicId topicWeight
[(0, 0.017), (5, 0.011), (31, 0.012), (40, 0.031), (54, 0.02), (58, 0.552), (61, 0.016), (66, 0.13), (76, 0.016), (88, 0.039), (92, 0.027), (94, 0.03), (99, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.96517658 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
2 0.95670831 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
Author: Bo Jiang, Xuegong Zhang, Tianxi Cai
Abstract: Support vector machine (SVM) is one of the most popular and promising classification algorithms. After a classification rule is constructed via the SVM, it is essential to evaluate its prediction accuracy. In this paper, we develop procedures for obtaining both point and interval estimators for the prediction error. Under mild regularity conditions, we derive the consistency and asymptotic normality of the prediction error estimators for SVM with finite-dimensional kernels. A perturbationresampling procedure is proposed to obtain interval estimates for the prediction error in practice. With numerical studies on simulated data and a benchmark repository, we recommend the use of interval estimates centered at the cross-validated point estimates for the prediction error. Further applications of the proposed procedure in model evaluation and feature selection are illustrated with two examples. Keywords: k-fold cross-validation, model evaluation, perturbation-resampling, prediction errors, support vector machine
3 0.57793361 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
4 0.52024066 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
Author: Francis R. Bach
Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators
5 0.51339096 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
Author: David C. Hoyle
Abstract: Bayesian inference from high-dimensional data involves the integration over a large number of model parameters. Accurate evaluation of such high-dimensional integrals raises a unique set of issues. These issues are illustrated using the exemplar of model selection for principal component analysis (PCA). A Bayesian model selection criterion, based on a Laplace approximation to the model evidence for determining the number of signal principal components present in a data set, has previously been show to perform well on various test data sets. Using simulated data we show that for d-dimensional data and small sample sizes, N, the accuracy of this model selection method is strongly affected by increasing values of d. By taking proper account of the contribution to the evidence from the large number of model parameters we show that model selection accuracy is substantially improved. The accuracy of the improved model evidence is studied in the asymptotic limit d → ∞ at fixed ratio α = N/d, with α < 1. In this limit, model selection based upon the improved model evidence agrees with a frequentist hypothesis testing approach. Keywords: PCA, Bayesian model selection, random matrix theory, high dimensional inference
6 0.49322438 23 jmlr-2008-Comments on the Complete Characterization of a Family of Solutions to a GeneralizedFisherCriterion
7 0.46925202 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
8 0.46851295 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
9 0.46754134 86 jmlr-2008-SimpleMKL
10 0.45071074 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
11 0.44669008 56 jmlr-2008-Magic Moments for Structured Output Prediction
12 0.44330761 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
13 0.44198906 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
14 0.43873918 83 jmlr-2008-Robust Submodular Observation Selection
15 0.43390322 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons
16 0.42874527 57 jmlr-2008-Manifold Learning: The Price of Normalization
17 0.42570215 7 jmlr-2008-A Tutorial on Conformal Prediction
18 0.42290115 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
19 0.42180952 80 jmlr-2008-Ranking Individuals by Group Comparisons
20 0.42147273 16 jmlr-2008-Approximations for Binary Gaussian Process Classification