jmlr jmlr2010 jmlr2010-43 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michel Journée, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms
Reference: text
sentIndex sentText sentNum sentScore
1 of Electrical Engineering and Computer Science University of Li` ge e B-4000 Li` ge, Belgium e Editor: Aapo Hyvarinen Abstract In this paper we develop a new approach to sparse principal component analysis (sparse PCA). [sent-13, score-0.122]
2 We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. [sent-14, score-0.372]
3 It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. [sent-18, score-0.159]
4 Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms 1. [sent-20, score-0.232]
5 Principal components are, in general, combinations of all the input variables, that is, the loading vector z∗ is not expected to have many zero coefficients. [sent-29, score-0.122]
6 The objective of sparse principal component analysis (sparse PCA) is to find a reasonable trade-off between these conflicting goals. [sent-35, score-0.122]
7 For about a decade, sparse PCA has been a topic of active research. [sent-38, score-0.108]
8 For example, Jolliffe (1995) consider using various rotation techniques to find sparse loading vectors in the subspace identified by PCA. [sent-40, score-0.177]
9 These methods usually cast the sparse PCA problem in the form of an optimization program, aiming at maximizing explained variance penalized for the number of non-zero loadings. [sent-43, score-0.139]
10 (2007) in their DSPCA algorithm exploit convex optimization tools to solve a convex relaxation of the sparse PCA problem. [sent-49, score-0.15]
11 Therefore, block approaches for sparse PCA are expected to be more efficient on ill-posed problems. [sent-63, score-0.16]
12 4) of sparse PCA, aimed at extracting m sparse principal components, with m = 1 in the former case and p ≥ m > 1 in the latter. [sent-68, score-0.202]
13 1 Although we assume a direct access to the data matrix A, these formulations also hold when only the covariance matrix Σ is available, provided that a factorization of the form Σ = AT A is identified (e. [sent-70, score-0.11]
14 While achieving a balance between the explained variance and sparsity which is the same as or superior to the existing methods, 1. [sent-85, score-0.116]
15 For a self-adjoint positive definite linear operator G : E → E∗ we define a pair of norms on E and E∗ as follows def s 1/2 , = ∗ Gx, x def x s, G−1 s 1/2 , = x ∈ E, s∈ (2) E∗ . [sent-113, score-0.142]
16 Some Formulations of the Sparse PCA Problem In this section we propose four formulations of the sparse PCA problem, all in the form of the general optimization framework (P). [sent-134, score-0.122]
17 The first two deal with the single-unit sparse PCA problem and the remaining two are their generalizations to the block case. [sent-135, score-0.16]
18 1 Single-unit Sparse PCA via ℓ1 -Penalty Let us consider the optimization problem def φℓ1 (γ) = max n z∈B zT Σz − γ z 1 , (3) with sparsity-controlling parameter γ ≥ 0 and sample covariance matrix Σ = AT A. [sent-137, score-0.14]
19 Indeed, since max z=0 Az 2 ∑i zi ai = max z 1 z 1 z=0 2 ≤ max z=0 ∑i |zi | ai ∑i |zi | 2 = max ai i 2 = ai∗ 2, we get Az 2 − γ z 1 < 0 for all nonzero vectors z whenever γ is chosen to be strictly bigger than ai∗ 2 . [sent-143, score-0.31]
20 Due to these considerations, we will consider the solution of (3) to be a sparse principal component of A. [sent-148, score-0.122]
21 Note that it is possible to say something about the sparsity of the solution even without the knowledge of x∗ : γ ≥ ai 2 ⇒ z∗ (γ) = 0, i i = 1, . [sent-166, score-0.119]
22 (2008) consider the formulation def φℓ0 (γ) = max zT Σz − γ z 0 , n z∈B which directly penalizes the number of nonzero components (cardinality) of the vector z. [sent-172, score-0.123]
23 3 Block Sparse PCA via ℓ1 -Penalty Consider the following block generalization of (5), m n j=1 i=1 φℓ1 ,m (γ) = max Tr(X T AZN) − ∑ γ j ∑ |zi j |, p def X∈Sm Z∈[S n ]m (15) where the m-dimensional vector γ = [γ1 , . [sent-195, score-0.178]
24 Most existing algorithms for computing several sparse principal components, for example, Zou et al. [sent-207, score-0.122]
25 2 S PARSITY A solution X ∗ of (16) again defines the sparsity pattern of the matrix Z ∗ : the entry z∗j is active if i µ j |aT x∗ | > γ j , i j and equal to zero otherwise. [sent-217, score-0.111]
26 The columns of the solution Z ∗ of (15) are thus the m dominant right singular vectors of A, that is, the PCA loading vectors. [sent-227, score-0.161]
27 However, as it will be briefly illustrated in the forthcoming numerical experiments (Section 5), having distinct elements on the diagonal of N pushes towards sparse loading vectors that are more orthogonal. [sent-231, score-0.177]
28 4 Block Sparse PCA via Cardinality Penalty The single-unit cardinality-penalized case can also be naturally extended to the block case: m φℓ0 ,m (γ) = max Tr(Diag(X T AZN)2 ) − ∑ γ j z j 0 , p def X∈Sm Z∈[S n ]m (18) j=1 where the sparsity inducing vector γ = [γ1 , . [sent-233, score-0.235]
29 k+1 Proof From convexity of f we immediately get f (xk+1 ) ≥ f (xk ) + f ′ (xk ), xk+1 − xk = f (xk ) + ∆(xk ), and therefore, f (xk+1 ) ≥ f (xk ) for all k. [sent-270, score-0.139]
30 It can be shown (see Appendix A), that level sets of strongly convex functions with Lipschitz continuous gradient are again strongly convex. [sent-286, score-0.109]
31 Note that in the case of the two formulations (8) and (13) of the sparse PCA problem, the feasible set Q is the unit Euclidean sphere. [sent-290, score-0.143]
32 Since the convex hull of the unit sphere is the unit ball, which is a strongly convex set, the feasible set of our sparse PCA formulations satisfies Assumption 3. [sent-291, score-0.291]
33 Proposition 3 If f is convex, then for any two subsequent iterates xk , xk+1 of Algorithm 1 σQ ′ f (xk ) ∗ xk+1 − xk 2 . [sent-297, score-0.24]
34 We will use this inequality with def y = yα = xk + α(xk+1 − xk ) + σQ α(1 − α) xk+1 − xk 2 In view of (25), yα ∈ Conv(Q ), and therefore 0 ≥ f ′ (xk ), yα − xk+1 = (1 − α) f ′ (xk ), xk − xk+1 + α ∈ [0, 1]. [sent-301, score-0.551]
35 σQ α(1 − α) xk+1 − xk 2 Since α is an arbitrary value from [0, 1], the result follows. [sent-302, score-0.12]
36 If {xk } is the sequence of points generated by Algorithm 1, then ∞ 2( f ∗ − f (x0 )) (26) ∑ xk+1 − xk 2 ≤ σQ δ f + σ f . [sent-306, score-0.12]
37 k=0 Proof Since f is convex, Proposition 3 gives f (xk+1 ) − f (xk ) ≥ ∆(xk ) + σf xk+1 − xk 2 2 1 ≥ (σQ δ f + σ f ) xk+1 − xk 2 . [sent-307, score-0.24]
38 The example above illustrates an easy “trick” to turn a convex convex objective function into a strongly convex one: one simply adds to the original objective function a strongly convex function that is constant on the boundary of the feasible set. [sent-333, score-0.214]
39 It can be then further deduced that this set is not strongly convex (σQ = 0) and as a consequence, Theorem 4 is meaningful only if f is strongly convex (σ f > 0). [sent-347, score-0.144]
40 Then max C, X = max V ΣW T , X p p X∈Sm X∈Sm = max Σ,V T XW p X∈Sm m m = max Σ, Z = max ∑ σi (C)zii ≤ ∑ σi (C). [sent-363, score-0.135]
41 Note that the block sparse PCA formulations (16) and (20) conform to this setting. [sent-369, score-0.202]
42 Algorithms for Sparse PCA The solutions of the sparse PCA formulations of Section 2 provide locally optimal patterns of zeros and nonzeros for a vector z ∈ S n (in the single-unit case) or a matrix Z ∈ [S n ]m (in the block case). [sent-377, score-0.228]
43 An algorithm for sparse PCA combines thus a method that identifies a “good” pattern of sparsity with a method that fills the active entries. [sent-379, score-0.165]
44 In the sequel, we discuss the general block sparse PCA problem. [sent-380, score-0.16]
45 1 Methods for Pattern-finding The application of our general method (Algorithm 1) to the four sparse PCA formulations of Section 2, that is, (8), (13), (16) and (20), leads to Algorithms 2, 3, 4 and 5 below, that provide a locally optimal pattern of sparsity for a matrix Z ∈ [S n ]m . [sent-383, score-0.205]
46 This pattern is defined as a binary matrix P ∈ {0, 1}n×m such that pi j = 1 if the loading zi j is active and pi j = 0 otherwise. [sent-384, score-0.203]
47 In case of the single-unit algorithms, such an initial iterate x ∈ S p is chosen parallel to the column of A with the largest norm, that is, x= ai∗ , ai∗ 2 where i∗ = arg max ai 2 . [sent-389, score-0.108]
48 Let us finally mention that the input matrix A of these algorithms can be the data matrix itself as well as any matrix such that the factorization Σ = AT A of the covariance matrix holds. [sent-407, score-0.12]
49 Problem (32) assigns the active part of the loading vectors Z to maximize the variance explained by the resulting components. [sent-428, score-0.184]
50 , m do x j ←− ∑n µ2 [sign((µ j aT x j )2 − γ j )]+ aT x j ai i=1 j i i X ←− Polar(X) until a stopping criterion is satisfied Construct matrix P ∈ {0, 1}n×m such that pi j = 1 pi j = 0 if (µ j aT x j )2 > γ j i otherwise. [sent-442, score-0.124]
51 Lemma 9 For a fixed Z ∈ [S n ]m , a solution X ∗ of max Tr(X T AZN) p X∈Sm is provided by the U factor of the polar decomposition of the product AZN. [sent-445, score-0.108]
52 Lemma 10 The solution def Z ∗ = arg maxm Tr(X T AZN), n Z∈[S ] ZP′ =0 (34) p ∗ ∗ is at any point X ∈ Sm defined by the two conditions ZP = (AT XND)P and ZP′ = 0, where D is a positive diagonal matrix that normalizes each column of Z ∗ to unit norm, that is, 1 D = Diag(NX T AAT XN)− 2 . [sent-447, score-0.118]
53 In fact, since the cardinality penalty only depends on the sparsity pattern P and not on the actual values assigned to ZP , a solution (X ∗ , Z ∗ ) of Algorithms 3 or 5 is also a local maximizer of (32) for the resulting pattern P. [sent-459, score-0.178]
54 3 Sparse PCA Algorithms To sum up, in this paper we propose four sparse PCA algorithms, each combining a method to identify a “good” sparsity pattern with a method to fill the active entries of the m loading vectors. [sent-463, score-0.262]
55 4 Deflation Scheme For the sake of completeness, we recall a classical deflation process for computing m sparse principal components with a single-unit algorithm (d’Aspremont et al. [sent-468, score-0.147]
56 Let z ∈ Rn be a unit-norm sparse loading vector of the data A. [sent-470, score-0.177]
57 Subsequent directions can be sequentially obtained by computing a dominant sparse component of the residual matrix A − xzT , where x = Az is the vector that solves min A − xzT F. [sent-471, score-0.129]
58 This method solves a convex relaxation of the sparse PCA problem and has a large per-iteration computational complexity of O (n3 ) compared to the other methods. [sent-500, score-0.115]
59 Given a data matrix A ∈ R p×n , the considered sparse PCA algorithms provide m unit-norm sparse loading vectors stored in the matrix Z ∈ [S n ]m . [sent-516, score-0.309]
60 When z corresponds to the first principal loading vector, the variance is Var(z) = σmax (A)2 . [sent-520, score-0.166]
61 Hence, summing up the variance explained individually by each of the components overestimates the variance explained simultaneously by all the components. [sent-522, score-0.143]
62 The adjusted variance of the m components Y = AZ is defined as AdjVar Z = Tr R2 , p where Y = QR is the QR decomposition of the components sample matrix Y (Q ∈ Sm and R is an m × m upper triangular matrix). [sent-525, score-0.12]
63 1 Random Data Drawn from a Sparse PCA Model In this section, we follow the procedure proposed by Shen and Huang (2008) to generate random data with a covariance matrix having sparse eigenvectors. [sent-527, score-0.122]
64 To this end, a covariance matrix is first synthesized through the eigenvalue decomposition Σ = V DV T , where the first m columns of V ∈ Rn×n are pre-specified sparse orthonormal vectors. [sent-528, score-0.203]
65 We generate 500 data matrices A ∈ R p×n and employ the four GPower algorithms as well as Greedy to compute two unit-norm sparse loading vectors z1 , z2 ∈ R500 , which are hoped to be close to v1 and v2 . [sent-540, score-0.177]
66 Note that Greedy requires to specify the targeted cardinalities as an input, that is, ten nonzeros entries for both loading vectors. [sent-552, score-0.136]
67 The chances of recovery of the sparse model underlying the data are rather good, and some versions of the algorithms successfully recover the sparse model even when the parameters γ are chosen at random. [sent-557, score-0.16]
68 Note that while the latter method requires the exact knowledge of the cardinality of each component, the GPower algorithms find the sparse model that fits the data best without this information. [sent-559, score-0.164]
69 Looking at the values reported in Table 5, we observe that the block GPower algorithms are more likely to obtain loading vectors that are “more orthogonal” when using parameters µ j which are distinct. [sent-577, score-0.177]
70 96 0 1 Table 5: Average of the quantities |zT z2 |, |vT z1 |, |vT z2 | and proportion of successful identifications 1 1 2 of the two dominant sparse eigenvectors of Σ by extracting two sparse principal components from 500 data matrices. [sent-630, score-0.268]
71 The Greedy algorithm requires prior knowledge of the cardinalities of each component, while the GPower algorithms are very likely to identify the underlying sparse model without this information. [sent-631, score-0.119]
72 1 T RADE - OFF C URVES Let us first compare the single-unit algorithms, which provide a unit-norm sparse loading vector z ∈ Rn . [sent-639, score-0.177]
73 We first plot the variance explained by the extracted component against the cardinality of the resulting loading vector z. [sent-640, score-0.24]
74 For each algorithm, the sparsity-inducing parameter is incrementally increased to obtain loading vectors z with a cardinality that decreases from n to 1. [sent-641, score-0.181]
75 If one, however, post-processes the active part of z according to (33), as we do in GPowerℓ1 , all sparse PCA methods reach the same performance. [sent-645, score-0.108]
76 The vertical axis is the ratio Var(zsPCA )/ Var(zPCA ), where the loading vector zsPCA is computed by sparse PCA and zPCA is the first principal loading vector. [sent-656, score-0.316]
77 2 C ONTROLLING S PARSITY WITH γ Among the considered methods, the greedy approach is the only one to directly control the cardinality of the solution, that is, the desired cardinality is an input of the algorithm. [sent-661, score-0.224]
78 In Figure 2, we plot the average relationship between the parameter γ and the resulting cardinality of the loading vector z for the two algorithms GPowerℓ1 and GPowerℓ0 . [sent-664, score-0.181]
79 9 1 Normalized sparsity inducing parameter Figure 2: Dependence of cardinality on the value of the sparsity-inducing parameter γ. [sent-688, score-0.141]
80 One immediately notices that the greedy method slows down significantly as cardinality increases, whereas the speed of the other considered algorithms does not depend on cardinality. [sent-698, score-0.14]
81 7 Greedy Computational time [sec] 6 GPowerℓ0 , GPowerℓ1 , SPCA, rSVDℓ0 , rSVDℓ1 5 4 3 2 1 0 0 50 100 150 Cardinality 200 250 300 Figure 3: The computational complexity of Greedy grows significantly with cardinality of the resulting loading vector. [sent-700, score-0.181]
82 5 D IFFERENT C ONVERGENCE M ECHANISMS Figure 4 illustrates how the trade-off between explained variance and sparsity evolves in the time of computation for the two methods GPowerℓ1 and rSVDℓ1 . [sent-717, score-0.116]
83 In the second run, this parameter is set to a positive value and the method works to rapidly decrease cardinality at the expense of only a modest decrease in explained variance. [sent-775, score-0.116]
84 5 Figure 4: Evolution of explained variance (left) and cardinality (right) in time for the methods GPowerℓ1 and rSVDℓ1 run on a test problem of size p = 250 and n = 2500. [sent-798, score-0.143]
85 Again, these last two methods can be improved by postprocessing the resulting loading vectors with Algorithm 6, as it is done for GPowerℓ1 ,m . [sent-805, score-0.126]
86 Following these previous studies, we use the GPower algorithms to compute six sparse principal components of the data. [sent-860, score-0.147]
87 In Table 9, we provide the total cardinality and the proportion of adjusted variance explained by six components computed with SPCA, rSVDℓ1 , Greedy as well as our GPower algorithms. [sent-895, score-0.186]
88 Table 9 illustrates that better patterns can be identified with the GPower algorithms, that is, patterns that explain more variance with the same cardinality (and sometimes even with a smaller one). [sent-903, score-0.111]
89 For GPowerℓ1 , one defines the upper( j) ¯ bounds γ j = maxi ai 2 , where A( j) is the residual data matrix after j − 1 deflation steps. [sent-964, score-0.178]
90 ¯ For GPowerℓ1 ,m , the upper-bounds are γ j = µ j maxi ai 2 . [sent-965, score-0.152]
91 3 T RADE - OFF C URVES Figure 6 plots the proportion of adjusted variance versus the cardinality for the “Vijver” data set. [sent-1015, score-0.129]
92 1 0 0 2 4 6 8 Cardinality 10 12 14 4 x 10 Figure 6: Trade-off curves between explained variance and cardinality for the “Vijver” data set. [sent-1028, score-0.143]
93 The vertical axis is the ratio AdjVar(ZsPCA )/ AdjVar(ZPCA ), where the loading vectors ZsPCA are computed by sparse PCA and ZPCA are the m first principal loading vectors. [sent-1029, score-0.316]
94 The PEI is the fraction of these 536 sets presenting a statistically significant overlap with the genes inferred from the sparse principal components. [sent-1040, score-0.144]
95 Conclusion We have proposed two single-unit and two block formulations of the sparse PCA problem and constructed reformulations with several favorable properties. [sent-1114, score-0.227]
96 This structure allows for the design and iteration complexity analysis of a simple gradient scheme which applied to our sparse PCA setting results in 549 ´ ´ J OURN E E , N ESTEROV, R ICHT ARIK AND S EPULCHRE four new algorithms for computing sparse principal components of a matrix A ∈ R p×n . [sent-1116, score-0.253]
97 Second, our algorithms appear to be faster if either the objective function or the feasible set are strongly convex, which holds in the single-unit case and can be enforced in the block case. [sent-1117, score-0.117]
98 Finally, in the case of the biological data, the components obtained by our block algorithms deliver the richest biological interpretation as compared to the components extracted by the other methods. [sent-1121, score-0.166]
99 Then for any ω > 0, the set def Qω = {x | f (x) ≤ ω} is strongly convex with convexity parameter σQω = σf . [sent-1135, score-0.162]
100 Spectral bounds for sparse PCA: Exact and greedy algorithms. [sent-1244, score-0.136]
wordName wordTfidf (topN-words)
[('gpower', 0.771), ('rsvd', 0.276), ('pca', 0.222), ('sm', 0.129), ('xk', 0.12), ('spca', 0.116), ('arik', 0.113), ('epulchre', 0.113), ('icht', 0.113), ('ourn', 0.113), ('loading', 0.097), ('esterov', 0.096), ('maxi', 0.09), ('ower', 0.087), ('cardinality', 0.084), ('zp', 0.082), ('block', 0.08), ('sparse', 0.08), ('ethod', 0.075), ('eneralized', 0.075), ('def', 0.071), ('polar', 0.064), ('conv', 0.063), ('aspremont', 0.062), ('ai', 0.062), ('sparsity', 0.057), ('greedy', 0.056), ('parse', 0.052), ('stiefel', 0.05), ('ation', 0.048), ('vijver', 0.048), ('azn', 0.044), ('az', 0.043), ('principal', 0.042), ('formulations', 0.042), ('cardinalities', 0.039), ('naderi', 0.038), ('strongly', 0.037), ('tr', 0.037), ('convex', 0.035), ('explained', 0.032), ('aat', 0.031), ('moghaddam', 0.031), ('parsity', 0.031), ('pei', 0.031), ('shen', 0.031), ('lf', 0.031), ('zou', 0.029), ('postprocessing', 0.029), ('active', 0.028), ('huang', 0.027), ('variance', 0.027), ('max', 0.027), ('diag', 0.026), ('matrix', 0.026), ('eformulation', 0.025), ('journ', 0.025), ('reformulations', 0.025), ('teschendorff', 0.025), ('xzt', 0.025), ('zpca', 0.025), ('zspca', 0.025), ('components', 0.025), ('breast', 0.023), ('im', 0.023), ('dominant', 0.023), ('orthonormal', 0.023), ('zt', 0.022), ('gene', 0.022), ('genes', 0.022), ('columns', 0.022), ('unit', 0.021), ('yurii', 0.021), ('vt', 0.021), ('reformulation', 0.02), ('sphere', 0.02), ('sign', 0.02), ('table', 0.02), ('jolliffe', 0.02), ('eigenvalue', 0.019), ('convexity', 0.019), ('singular', 0.019), ('penalty', 0.019), ('iterate', 0.019), ('adjvar', 0.019), ('belgian', 0.019), ('pitprops', 0.019), ('richt', 0.019), ('rik', 0.019), ('sepulchre', 0.019), ('nesterov', 0.019), ('rs', 0.019), ('proportion', 0.018), ('pi', 0.018), ('biological', 0.018), ('maximizer', 0.018), ('ct', 0.017), ('decomposition', 0.017), ('covariance', 0.016), ('zi', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
Author: Michel Journée, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms
2 0.085470304 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
Author: Alexander Ilin, Tapani Raiko
Abstract: Principal component analysis (PCA) is a classical data analysis technique that Ä?Ĺš nds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overÄ?Ĺš tting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overÄ?Ĺš tting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artiÄ?Ĺš cial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the NetÄ?Ĺš‚ix problem. Keywords: principal component analysis, missing values, overÄ?Ĺš tting, regularization, variational Bayes
3 0.083720185 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
Author: Ryo Yoshida, Mike West
Abstract: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate “artificial” posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data. Keywords: annealing, graphical factor models, variational mean-field method, MAP estimation, sparse factor analysis, gene expression profiling
4 0.070443749 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
5 0.068397343 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
6 0.048421625 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
7 0.046763245 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
8 0.043184489 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
9 0.040650263 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
10 0.040591545 15 jmlr-2010-Approximate Tree Kernels
11 0.036564238 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
12 0.035978604 72 jmlr-2010-Matrix Completion from Noisy Entries
13 0.035533719 84 jmlr-2010-On Spectral Learning
14 0.03367668 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
15 0.031519141 6 jmlr-2010-A Rotation Test to Verify Latent Structure
16 0.031499885 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
17 0.028096674 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
18 0.027660772 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide
19 0.025536576 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
20 0.025446141 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data
topicId topicWeight
[(0, -0.133), (1, -0.054), (2, 0.052), (3, 0.005), (4, -0.028), (5, -0.017), (6, 0.069), (7, -0.086), (8, -0.02), (9, -0.099), (10, -0.025), (11, 0.138), (12, 0.025), (13, 0.076), (14, 0.011), (15, 0.048), (16, 0.005), (17, 0.034), (18, -0.091), (19, -0.08), (20, -0.151), (21, -0.07), (22, -0.162), (23, -0.013), (24, -0.004), (25, -0.034), (26, -0.045), (27, -0.117), (28, -0.043), (29, 0.086), (30, 0.021), (31, -0.451), (32, 0.082), (33, 0.174), (34, 0.16), (35, 0.018), (36, 0.029), (37, -0.046), (38, -0.269), (39, 0.101), (40, -0.038), (41, 0.087), (42, -0.06), (43, -0.02), (44, 0.127), (45, 0.071), (46, 0.034), (47, -0.007), (48, 0.004), (49, 0.206)]
simIndex simValue paperId paperTitle
same-paper 1 0.93238372 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
Author: Michel Journée, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms
2 0.54325306 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
Author: Ryo Yoshida, Mike West
Abstract: We describe a class of sparse latent factor models, called graphical factor models (GFMs), and relevant sparse learning algorithms for posterior mode estimation. Linear, Gaussian GFMs have sparse, orthogonal factor loadings matrices, that, in addition to sparsity of the implied covariance matrices, also induce conditional independence structures via zeros in the implied precision matrices. We describe the models and their use for robust estimation of sparse latent factor structure and data/signal reconstruction. We develop computational algorithms for model exploration and posterior mode search, addressing the hard combinatorial optimization involved in the search over a huge space of potential sparse configurations. A mean-field variational technique coupled with annealing is developed to successively generate “artificial” posterior distributions that, at the limiting temperature in the annealing schedule, define required posterior modes in the GFM parameter space. Several detailed empirical studies and comparisons to related approaches are discussed, including analyses of handwritten digit image and cancer gene expression data. Keywords: annealing, graphical factor models, variational mean-field method, MAP estimation, sparse factor analysis, gene expression profiling
3 0.39978996 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
Author: Alexander Ilin, Tapani Raiko
Abstract: Principal component analysis (PCA) is a classical data analysis technique that Ä?Ĺš nds linear transformations of data that retain the maximal amount of variance. We study a case where some of the data values are missing, and show that this problem has many features which are usually associated with nonlinear models, such as overÄ?Ĺš tting and bad locally optimal solutions. A probabilistic formulation of PCA provides a good foundation for handling missing values, and we provide formulas for doing that. In case of high dimensional and very sparse data, overÄ?Ĺš tting becomes a severe problem and traditional algorithms for PCA are very slow. We introduce a novel fast algorithm and extend it to variational Bayesian learning. Different versions of PCA are compared in artiÄ?Ĺš cial experiments, demonstrating the effects of regularization and modeling of posterior variance. The scalability of the proposed algorithm is demonstrated by applying it to the NetÄ?Ĺš‚ix problem. Keywords: principal component analysis, missing values, overÄ?Ĺš tting, regularization, variational Bayes
4 0.34264618 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
Author: Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro
Abstract: Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements—is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set in order to adapt it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large data sets. Keywords: basis pursuit, dictionary learning, matrix factorization, online learning, sparse coding, sparse principal component analysis, stochastic approximations, stochastic optimization, nonnegative matrix factorization
5 0.32039207 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
6 0.28760371 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
7 0.26421422 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
8 0.21835804 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
9 0.17475648 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
10 0.16748397 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
11 0.15342511 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
12 0.1495481 72 jmlr-2010-Matrix Completion from Noisy Entries
13 0.13445605 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
14 0.13098301 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
15 0.12794182 6 jmlr-2010-A Rotation Test to Verify Latent Structure
16 0.12014581 15 jmlr-2010-Approximate Tree Kernels
17 0.11526842 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
18 0.11485709 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
19 0.11425714 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines
20 0.11333385 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
topicId topicWeight
[(3, 0.02), (4, 0.015), (8, 0.027), (15, 0.015), (21, 0.029), (24, 0.011), (32, 0.072), (36, 0.026), (37, 0.052), (69, 0.356), (75, 0.177), (81, 0.019), (85, 0.058), (96, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.68637246 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
Author: Michel Journée, Yurii Nesterov, Peter Richtárik, Rodolphe Sepulchre
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed. Keywords: sparse PCA, power method, gradient ascent, strongly convex sets, block algorithms
2 0.50148886 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
3 0.4963271 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
4 0.4952952 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
Author: Pannagadatta K. Shivaswamy, Tony Jebara
Abstract: Leading classification methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identifies its sensitivity to affine transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classification function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classification function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efficiency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classification and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity
5 0.49462706 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
Author: Jitkomut Songsiri, Lieven Vandenberghe
Abstract: An algorithm is presented for topology selection in graphical models of autoregressive Gaussian time series. The graph topology of the model represents the sparsity pattern of the inverse spectrum of the time series and characterizes conditional independence relations between the variables. The method proposed in the paper is based on an ℓ1 -type nonsmooth regularization of the conditional maximum likelihood estimation problem. We show that this reduces to a convex optimization problem and describe a large-scale algorithm that solves the dual problem via the gradient projection method. Results of experiments with randomly generated and real data sets are also included. Keywords: graphical models, time series, topology selection, convex optimization
6 0.49302837 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
7 0.49289629 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
9 0.49051431 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
10 0.48976773 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
11 0.48967269 63 jmlr-2010-Learning Instance-Specific Predictive Models
12 0.48889244 66 jmlr-2010-Linear Algorithms for Online Multitask Classification
13 0.48684707 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing
14 0.48621073 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
15 0.48614347 40 jmlr-2010-Fast and Scalable Local Kernel Machines
16 0.48601916 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking
17 0.48551169 109 jmlr-2010-Stochastic Composite Likelihood
18 0.48338541 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
19 0.48264608 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification
20 0.4815923 69 jmlr-2010-Lp-Nested Symmetric Distributions