jmlr jmlr2011 jmlr2011-37 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
Reference: text
sentIndex sentText sentNum sentScore
1 Municipio Plaza de la Revoluci´ n a o Universidad de La Habana Habana, Cuba Editor: Tong Zhang Abstract In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. [sent-14, score-0.267]
2 We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. [sent-15, score-0.322]
3 Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. [sent-16, score-0.264]
4 Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. [sent-17, score-0.331]
5 Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. [sent-18, score-0.191]
6 Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA 1. [sent-19, score-0.245]
7 Based on the noisy observations (1), an important problem in statistics is to construct an estimator of the covariance matrix Σ = E XX ⊤ of the process X at the design points, where X = (X (t1 ) , . [sent-40, score-0.29]
8 (2010), using N independent copies of the process X, we have proposed to construct an estimator of the covariance matrix Σ by expanding the process X into a dictionary of basis functions. [sent-47, score-0.439]
9 This new approach to covariance estimation is well adapted to the case of low-dimensional covariance estimation when the number of replicates N of the process is larger than the number of observations points n. [sent-50, score-0.243]
10 Then, when N → c > 0 as n, N → +∞, neither the eigenvalues nor the eigenvectors of the sample covariance matrix S are consistent estimators of the eigenvalues and eigenvectors of Σ (see Johnstone, 2001). [sent-71, score-0.358]
11 To achieve consistency, recently developed methods for high-dimensional covariance estimation impose sparsity restrictions on the matrix Σ. [sent-73, score-0.201]
12 Thresholding the components of the empirical covariance matrix has also been proposed by El Karoui (2008) and the consistency of such estimates is studied using 3188 G ROUP L ASSO E STIMATION OF H IGH - DIMENSIONAL C OVARIANCE M ATRICES tools from random matrix theory. [sent-77, score-0.208]
13 Another approach is to impose sparsity on the eigenvectors of the covariance matrix which leads to sparse PCA. [sent-84, score-0.276]
14 (2008) study properties of sparse principal components by convex programming, while Johnstone and Lu (2009) propose a PCA regularization by expanding the empirical eigenvectors in a sparse basis and then apply a thresholding step. [sent-87, score-0.213]
15 In this paper, we propose to estimate Σ in a high-dimensional setting by using the assumption that the process X has a sparse representation in a large dictionary of basis functions. [sent-88, score-0.185]
16 (2010), we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. [sent-90, score-0.214]
17 Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process X. [sent-91, score-0.185]
18 In Section 2, we describe a matrix regression model for covariance estimation, and we define our estimator by group Lasso regularization. [sent-96, score-0.298]
19 Various results existing in matrix theory show that convergence in operator norm implies convergence of the eigenvectors and eigenvalues (for example through the use of the sin(θ) theorems in Davis and Kahan, 1970). [sent-99, score-0.233]
20 Model and Definition of the Estimator To impose sparsity restrictions on the covariance matrix Σ, our approach is based on an approximation of the process in a finite dictionary of (not necessarily orthogonal) basis functions gm : T → R for m = 1, . [sent-104, score-0.432]
21 (7) m=1 Thus Ψλ ∈ RM×M can be interpreted as a group Lasso estimator of Σ in the following matrix regression model S = Σ + U + W ≈ GΨ∗ G⊤ + U + W , (8) N 1 1 where U ∈ Rn×n is a centered error matrix given by U = N ∑N Ui and W = N ∑ Wi . [sent-146, score-0.261]
22 , EN in model (1), while the term U = S − Σ represents the difference between the (unobserved) sample covariance matrix S and the matrix Σ that we want to estimate. [sent-151, score-0.187]
23 In the orthogonal case, the following proposition shows that the group Lasso estimator Ψλ defined by (7) consists in thresholding the columns/rows of Y whose ℓ2 norm is too small, and in multiplying the other columns/rows by weights between 0 and 1. [sent-155, score-0.307]
24 Then, the group Lasso estimator Ψλ defined by (7) is the n × n symmetric matrix whose entries are given by 2 0 if ∑M Y jk ≤ λγk , j=1 Ψλ = λγk 2 Ymk 1 − mk if ∑M Y jk > λγk , j=1 ∑M Y 2 j=1 mk for 1 ≤ k, m ≤ M. [sent-159, score-0.337]
25 For a symmetric p × p matrix A with real entries, ρmin (A) denotes the smallest eigenvalue of A, and ρmax (A) denotes the largest eigenvalue of A. [sent-164, score-0.298]
26 (2009) on the consistency of Lasso estimators in the standard nonparametric regression model using a large dictionary of basis functions. [sent-190, score-0.184]
27 (2009), a general condition called restricted eigenvalue assumption is introduced to control the minimal eigenvalues of the Gram matrix associated to the dictionary over sets of sparse vectors. [sent-192, score-0.318]
28 In the next sections, oracle inequalities for the group Lasso estimator will be derived under the following assumption on X: Assumption 2 The random vector X = (X (t1 ) , . [sent-212, score-0.207]
29 Assumption 2 will be used to control the deviation in operator norm between the sample covariance matrix S and the true covariance matrix Σ in the sense of the following proposition whose proof follows from Theorem 2. [sent-231, score-0.409]
30 Actually, it is well 2 known that the sample covariance S is a bad estimator of Σ in a high-dimensional setting, and that without any further restriction on the structure of the covariance matrix Σ, then S cannot be a consistent estimator. [sent-251, score-0.331]
31 Hence, inequality (14) N shows that, under an assumption of low-dimensionality of the process X, the deviation in operator 1 n norm between S and Σ depends on the ratio N and not on N , and thus the quality of S as an estimator of Σ is much better in such settings. [sent-257, score-0.248]
32 More generally, another assumption of low-dimensionality for the process X is to suppose that it has a sparse representation in a dictionary of basis functions, which may also improve the quality of S as an estimator of Σ. [sent-258, score-0.32]
33 To see this, consider the simplest case X = X 0 , where the process X 0 has a sparse representation in the basis (gm )1≤m≤M given by 2 1/α X 0 (t) = ∑ m∈J ∗ 1/2 am gm (t), t ∈ T, (15) where J ∗ ⊂ {1, . [sent-259, score-0.203]
34 Assume that X satisfies Assumption 2 and that the matrix G⊤ GJ ∗ is invertible, where GJ ∗ denotes the n × |J ∗ | J∗ matrix obtained by removing the columns of G whose indices are not in J ∗ . [sent-265, score-0.182]
35 Note that in the case of an orthonormal design (M = n and G⊤ G = In ) then min ( J ∗ J ) ρmax G⊤ GJ ∗ = ρmin G⊤ GJ ∗ = 1 for any J ∗ , and thus the gain in operator norm between S and J∗ J∗ n Σ clearly depends on the size of s∗ compared to N . [sent-271, score-0.186]
36 F n The following theorem provides an oracle inequality for the group Lasso estimator Σλ = GΨλ G⊤ . [sent-276, score-0.236]
37 Suppose that Assumption 1 holds with c0 = 3 + 4/ε, and that the covariance matrix Σnoise = E (W1 ) of the noise is positive-definite. [sent-279, score-0.259]
38 Consider the group Lasso estimator Σλ defined by (5) with the choices γk = 2 Gk ℓ2 ρmax (GG⊤ ), and λ = Σnoise 2 1+ n + N 2δ log M N 2 for some constant δ > 1. [sent-280, score-0.181]
39 As an example, suppose that X = X 0 , where the process X 0 has a sparse representation in the basis (gm )1≤m≤M given by X 0 (t) = ∑ m∈J ∗ am gm (t), t ∈ T, where J ∗ ⊂ {1, . [sent-284, score-0.231]
40 1 The second term n S − Σ 2 in (18) is a variance term as the empirical covariance matrix F S is an unbiased estimator of Σ. [sent-289, score-0.244]
41 Therefore, if M (Ψ) ≤ s with sparsity s much n smaller than n then the variance of the group Lasso estimator Σλ is smaller than the variance of S. [sent-294, score-0.202]
42 This shows some of the improvements achieved by regularization (7) of the empirical covariance matrix S with a group Lasso penalty. [sent-295, score-0.191]
43 An important assumption of Theorem 8 is that the covariance matrix of the noise Σnoise = E (W1 ) is positive definite. [sent-296, score-0.259]
44 3 An Oracle Inequality for the Operator Norm The “normalized” Frobenius norm Σλ − Σ 2 1 n Σλ − Σ 2 F , that is, the average of the eigenvalues of , can be viewed as a reasonable proxy for the operator norm Σλ − Σ eigenvalue of Σλ − Σ 2 2 2 (maximum ). [sent-308, score-0.258]
45 It is thus expected that the results of Theorem 8 imply that the group Lasso estimator Σλ is a good estimator of Σ in operator norm. [sent-309, score-0.318]
46 Let us recall that controlling the operator norm enables to study the convergence of the eigenvectors and eigenvalues of Σλ by controlling of the angles between the eigenspaces of a population and a sample covariance matrix through the use of the sin(θ) theorems in Davis and Kahan (1970). [sent-310, score-0.32]
47 Indeed the constant C1 in (20) depends on the a priori unknown sparsity s∗ and on the amplitude of the noise in the matrix regression model (8) measured by the quantities 8 ∗ ⊤ 2 and Σ 2 noise 2 . [sent-334, score-0.357]
48 The results of Theorem 10 indicate that if the ℓ2 -norm of the columns of Ψ∗ for k ∈ J ∗ are k sufficiently large with respect to the level of noise in the matrix regression model (8) and the sparsity s∗ , then Jˆ is a consistent estimation of the active set of variables. [sent-344, score-0.287]
49 First note that the above theorem gives a deviation in operator norm from ΣJˆ to the matrix ΣJ ∗ (24) which is not equal to the true covariance Σ of X at the design points. [sent-368, score-0.27]
50 Indeed, even if we know the true sparsity set J ∗ , the additive noise in the measurements in model (1) complicates the estimation of Σ in operator norm. [sent-369, score-0.236]
51 However, although ΣJ ∗ = Σ, they can have the same eigenvectors if the structure of the additive noise matrix term in (24) is not too complex. [sent-370, score-0.236]
52 Therefore, the eigenvectors of ΣJˆ can be used as estimators of the eigenvectors of Σ which is suitable for the sparse PCA application described in the next section on numerical experiments. [sent-373, score-0.197]
53 As a matter of fact, for covariance estimation in our setting, the group structure enables to impose a constraint on the number of non zero columns of the matrix Ψ and not on the single entries of the matrix Ψ. [sent-396, score-0.315]
54 This corresponds to the natural assumption of obtaining a sparse representation of the process X(t) in the basis given by the functions gm ’s and replacing its dimension by its sparsity. [sent-397, score-0.203]
55 This procedure leads to the following Lasso estimator of the covariance matrix Σ ΣL = GΨL G⊤ ∈ Rn×n . [sent-399, score-0.244]
56 In the orthogonal case (that is M = n and G⊤ G = In ), this gives rise to the estimator ΨL obtained by soft thresholding individually each entry Ymk of the matrix Y = G⊤ SG with the thresholds λγmk . [sent-400, score-0.218]
57 Proposition 13 (see below) allows a simple comparison of the statistical performances of the group Lasso estimator Σλ with those of the standard Lasso estimator ΣL in terms of upper bounds for the Frobenius norm. [sent-401, score-0.268]
58 Proposition 13 Assume that X satisfies model (25) and that the covariance matrix Σnoise = E (W1 ) of the noise is positive-definite. [sent-407, score-0.259]
59 Consider the group Lasso estimator Σλ and the standard Lasso estimator ΣL with the choices γk = 2, γmk = 2, λ = Σnoise 2 2+ 2δ log M N 2 for some constant δ > 1. [sent-408, score-0.288]
60 This comes from the fact that the sparsity prior of the Group Lasso is on the number of vanishing columns of the matrix Ψ, while the sparsity prior of the standard Lasso only controls the number of non-zero entries of Ψ. [sent-412, score-0.183]
61 Numerical Experiments and an Application to Sparse PCA In this section we present some simulated examples to illustrate the practical behaviour of the covariance matrix estimator by group Lasso regularization proposed in this paper. [sent-423, score-0.298]
62 In the numerical experiments, we use the explicit estimator described in Proposition 1 in the case M = n and an orthogonal design matrix G, and also the estimator proposed in the more general situation when n < M. [sent-425, score-0.31]
63 Note that the largest eigenvalue of Σ is γ2 F 22 with corresponding eigenvector F . [sent-453, score-0.321]
64 We suppose that the signal f has some sparse repℓ resentation in a large dictionary of basis functions of size M, given by {gm , m = 1, . [sent-454, score-0.237]
65 Under such an assumption and since γ1 > γ2 , the largest eigenvalue of Σ is γ2 with corresponding eigenvector F1 , and the second 1 largest eigenvalue of Σ is γ2 with corresponding eigenvector F2 . [sent-467, score-0.642]
66 We suppose that the signals f1 2 and f2 have some sparse representations in a large dictionary of basis functions of size M, given by f1 (t) = ∑M β1 gm (t) , and f2 (t) = ∑M β2 gm (t). [sent-468, score-0.419]
67 m m m m In models (27) and (28), we aim at estimating either F or F1 , F2 by the eigenvectors corresponding to the largest eigenvalues of the matrix ΣJˆ defined in (23), in a high-dimensional setting with n > N and by using different type of dictionaries. [sent-470, score-0.193]
68 Although the matrices ΣJ ∗ and Σ may have different eigenvectors (depending on the design points and chosen dictionary), the examples below show the eigenvectors of ΣJˆ can be used as estimators of the eigenvectors of Σ. [sent-472, score-0.25]
69 The estimator ΣJˆ of the covariance matrix Σ is computed as follows. [sent-473, score-0.244]
70 Once the dictionary has been chosen, we compute the covariance group Lasso (CGL) estimator Σλ = GΨλ G⊤ , where Ψλ is defined in (7). [sent-474, score-0.333]
71 For this, we need to have an idea of the true sparsity s∗ , since Jˆ defined in (20) depends on s∗ and also on unknown upper bounds on the level of noise in the matrix regression model (8) . [sent-481, score-0.213]
72 To overcome this drawback in our case, we can define the final covariance group Lasso (FCGL) estimator as the matrix ΣJˆ = GJˆΨJˆG⊤ , Jˆ 3204 (29) G ROUP L ASSO E STIMATION OF H IGH - DIMENSIONAL C OVARIANCE M ATRICES with Jˆ = Jˆε = k : Ψk ℓ2 > ε , where ε is a positive constant. [sent-485, score-0.298]
73 We also compute the empirical average of the operator norm of the estimator ΣJˆ with respect to the matrix ΣJ ∗ , defined by EAON ∗ = 1 P P p ∑ ΣJˆ − ΣJ ∗ p=1 2 . [sent-489, score-0.247]
74 It can be observed in Figures 1(a) and 1(b) that, as expected in this high dimensional setting (N < n), the empirical eigenvector of S associated to its largest empirical eigenvalue does not lead to a consistent estimator of F . [sent-508, score-0.481]
75 In Figures 1(c) and 1(d), we display the eigenvector associated to the largest eigenvalue of Σλ as an estimator of F . [sent-510, score-0.449]
76 Figures 1(e) and 1(f) illustrate the very good performance of the eigenvector associated to the largest eigenvalue of the matrix ΣJˆ as an estimator of F . [sent-513, score-0.499]
77 Signal HeaviSine - (a) Eigenvector associated to the largest eigenvalue of S, (c) Eigenvector associated to the largest eigenvalue of Σλ , (e) Eigenvector associated to the largest eigenvalue of ΣJ . [sent-606, score-0.51]
78 Signal Blocks - (b) Eigenvector associated to the largest eigenvalue of S, (d) Eigenvector associated to the largest eigenvalue of Σλ , (f) Eigenvector associated to the largest eigenvalue of ΣJ . [sent-607, score-0.51]
79 Obviously, the signal f1 has a sparse representation in a Haar basis while the signal f2 has a sparse representation in a Fourier basis. [sent-705, score-0.205]
80 More precisely, we construct a n × n orthogonal matrix G1 using the Haar basis and a n × n orthogonal matrix G2 using a Fourier basis (cosine and sine at various frequencies) at the design points. [sent-707, score-0.255]
81 In Figures 2(c) and 2(d), we display the eigenvector associated to the largest eigenvalue of Σλ as an estimator of F1 , and the eigenvector associated to the second largest eigenvalue of Σλ as an estimator of F2 . [sent-711, score-0.898]
82 Figures 2(e) and 2(f) illustrate the very good performance of the eigenvectors associated to the largest eigenvalue and second largest eigenvalue of the matrix ΣJˆ as estimators of F1 and F2 . [sent-714, score-0.468]
83 It can be observed in Figures 5(a) and 5(c) that, as expected in this high dimensional setting (N < n), the empirical eigenvector of S associated to its largest empirical eigenvalue are noisy versions of F . [sent-739, score-0.374]
84 In Figures 5(c) and 5(d) is shown the eigenvector associated to the largest eigenvalue of Σλ as an estimator of F . [sent-742, score-0.449]
85 Again, the eigenvector associated to the largest eigenvalue of the matrix ΣJ defined in (29) is much a better estimator of F . [sent-744, score-0.499]
86 (a) Signal F1 , (c) Signal F1 and eigenvector associated to the largest eigenvalue of Σλ , (e) Signal F1 and eigenvector associated to the largest eigenvalue of ΣJ with G = [G1 G2 ]. [sent-863, score-0.684]
87 (b) Signal F2 , (d) Signal F2 and eigenvector associated to the second largest eigenvalue of Σλ , (f) Signal F2 and eigenvector associated to the second largest eigenvalue of ΣJ with G = [G1 G2 ]. [sent-864, score-0.684]
88 (a) Signal F1 and Eigenvector associated to the largest eigenvalue of ΣJ with G = G1 , (b) Signal F2 and Eigenvector associated to the second largest eigenvalue of ΣJ with G = G1 . [sent-898, score-0.34]
89 (a) Signal F1 and Eigenvector associated to the largest eigenvalue of ΣJ with G = G2 , (b) Signal F2 and Eigenvector associated to the second largest eigenvalue of ΣJ with G = G2 . [sent-934, score-0.34]
90 Signal HeaviSine - (a) Eigenvector associated to the largest eigenvalue of S, (c) Eigenvector associated to the largest eigenvalue of Σλ , (e) Eigenvector associated to the largest eigenvalue of ΣJ . [sent-1040, score-0.51]
91 Signal Blocks - (b) Eigenvector associated to the largest eigenvalue of S, (d) Eigenvector associated to the largest eigenvalue of Σλ , (f) Eigenvector associated to the largest eigenvalue of ΣJ . [sent-1041, score-0.51]
92 denoted by vec (A), obtain by stacking the columns of the matrix A on top of one another. [sent-1054, score-0.231]
93 akn B In what follows, we repeatedly use the fact that the Frobenius norm is invariant by the vec operation meaning that A 2 = vec (A) 22 , (30) ℓ F and the properties that vec (ABC) = C ⊤ ⊗ A vec (B) , (31) (A ⊗ B)(C ⊗ D) = AC ⊗ BD, (32) and provided the above matrix products are compatible. [sent-1084, score-0.61]
94 m Therefore, ΨJ ∗ is a sample covariance matrix of size s∗ × s∗ and we can control its deviation in operator norm from ΨJ ∗ by using Proposition 6. [sent-1125, score-0.227]
95 Hence , S, Ψ∗ , G, Σ δ √k n Ψk ℓ2 > 2C1 (n, M, N, s∗ − C1 (n, M, N, s∗ = noise ) noise ) ∗ , G, Σ ˆ This completes the proof of Theorem 10. [sent-1252, score-0.244]
96 N i=1 G⊤ Xi for i = J∗ Therefore, ΨJ ∗ is a sample covariance matrix of size s∗ × s∗ and we can control its deviation in operator norm from ΛJ ∗ by using Proposition 6. [sent-1262, score-0.227]
97 Consistency of the group lasso and multiple kernel learning. [sent-1284, score-0.22]
98 Operator norm consistent estimation of large-dimensional sparse covariance matrices. [sent-1348, score-0.184]
99 High dimensional covariance matrix estimation using a factor model. [sent-1353, score-0.192]
100 On the asymptotic properties of the group lasso estimator for linear models. [sent-1412, score-0.327]
wordName wordTfidf (topN-words)
[('gj', 0.559), ('eaon', 0.364), ('eafn', 0.178), ('eigenvector', 0.172), ('lasso', 0.166), ('heavisine', 0.161), ('igot', 0.161), ('iscay', 0.161), ('niz', 0.161), ('oubes', 0.161), ('lvarez', 0.137), ('atrices', 0.137), ('ovariance', 0.137), ('vec', 0.13), ('asso', 0.123), ('roup', 0.123), ('noise', 0.122), ('stimation', 0.113), ('estimator', 0.107), ('igh', 0.105), ('gm', 0.103), ('eigenvalue', 0.099), ('covariance', 0.087), ('dictionary', 0.085), ('lounici', 0.077), ('bigot', 0.076), ('aj', 0.071), ('ymk', 0.068), ('eigenvectors', 0.064), ('mk', 0.063), ('blocks', 0.055), ('group', 0.054), ('haar', 0.053), ('columns', 0.051), ('operator', 0.05), ('matrix', 0.05), ('largest', 0.05), ('gk', 0.049), ('sm', 0.048), ('bickel', 0.047), ('signal', 0.047), ('tn', 0.046), ('proposition', 0.045), ('basis', 0.043), ('toulouse', 0.043), ('cgl', 0.042), ('event', 0.042), ('sparsity', 0.041), ('norm', 0.04), ('thresholding', 0.038), ('frobenius', 0.038), ('max', 0.038), ('tr', 0.037), ('remark', 0.037), ('orthonormal', 0.037), ('min', 0.036), ('rm', 0.035), ('pca', 0.035), ('estimators', 0.035), ('rn', 0.035), ('antoniadis', 0.034), ('biscay', 0.034), ('elizaveta', 0.034), ('fcgl', 0.034), ('loubes', 0.034), ('rolando', 0.034), ('symmlet', 0.034), ('sparse', 0.034), ('levina', 0.032), ('ei', 0.032), ('dimensional', 0.032), ('indices', 0.031), ('gg', 0.03), ('eigenvalues', 0.029), ('suppose', 0.028), ('inequality', 0.028), ('oracle', 0.027), ('bd', 0.027), ('aa', 0.026), ('lilian', 0.025), ('orlicz', 0.025), ('wavelet', 0.023), ('orthogonal', 0.023), ('process', 0.023), ('design', 0.023), ('estimation', 0.023), ('signals', 0.023), ('amplitude', 0.022), ('figures', 0.022), ('karim', 0.022), ('consistency', 0.021), ('associated', 0.021), ('xi', 0.021), ('alexandre', 0.021), ('copies', 0.021), ('fourier', 0.021), ('log', 0.02), ('rs', 0.02), ('theorem', 0.02), ('inequalities', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
2 0.19339783 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
3 0.16823754 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann
Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and refitting: first we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN
4 0.16496071 97 jmlr-2011-Union Support Recovery in Multi-task Learning
Author: Mladen Kolar, John Lafferty, Larry Wasserman
Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation
5 0.083491087 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
Author: Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, Francis Bach
Abstract: Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced tree-structured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and in this paper, we propose efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the tree-structured sparse approximation problem at the same computational cost as traditional ones using the ℓ1 -norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally self-organize in a prespecified arborescent structure, leading to better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. Keywords: Proximal methods, dictionary learning, structured sparsity, matrix factorization
6 0.074360654 91 jmlr-2011-The Sample Complexity of Dictionary Learning
7 0.073623545 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
8 0.071738638 59 jmlr-2011-Learning with Structured Sparsity
9 0.060105175 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
10 0.057927124 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
11 0.057584289 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
12 0.045553029 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
13 0.044679288 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
14 0.043865021 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
15 0.042870343 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
16 0.036913268 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
17 0.036577825 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
18 0.035853438 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
19 0.035631094 35 jmlr-2011-Forest Density Estimation
20 0.033924118 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
topicId topicWeight
[(0, 0.21), (1, -0.004), (2, 0.27), (3, 0.27), (4, -0.009), (5, -0.022), (6, -0.033), (7, -0.06), (8, -0.052), (9, 0.012), (10, 0.261), (11, -0.097), (12, 0.094), (13, -0.292), (14, -0.041), (15, 0.059), (16, 0.144), (17, -0.035), (18, -0.054), (19, -0.011), (20, 0.098), (21, 0.004), (22, 0.024), (23, 0.022), (24, -0.12), (25, -0.068), (26, 0.076), (27, 0.107), (28, 0.039), (29, 0.078), (30, -0.011), (31, 0.031), (32, 0.055), (33, -0.153), (34, -0.051), (35, 0.029), (36, 0.142), (37, -0.04), (38, 0.17), (39, -0.019), (40, 0.074), (41, -0.02), (42, 0.028), (43, 0.02), (44, 0.124), (45, -0.003), (46, 0.034), (47, 0.023), (48, 0.05), (49, 0.133)]
simIndex simValue paperId paperTitle
same-paper 1 0.94542021 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
2 0.70022982 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
Author: Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 -norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings. Keywords: sparsity, consistency, variable selection, convex optimization, active set algorithm
3 0.66163957 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann
Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and refitting: first we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN
4 0.51621836 97 jmlr-2011-Union Support Recovery in Multi-task Learning
Author: Mladen Kolar, John Lafferty, Larry Wasserman
Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation
5 0.29146633 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
6 0.278346 59 jmlr-2011-Learning with Structured Sparsity
7 0.25044259 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
8 0.2476387 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
9 0.24592042 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
10 0.24427703 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
11 0.21535246 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
12 0.21527576 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
13 0.20123689 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
14 0.18839799 6 jmlr-2011-A Simpler Approach to Matrix Completion
15 0.18088982 91 jmlr-2011-The Sample Complexity of Dictionary Learning
16 0.17501554 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
17 0.16880396 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
18 0.16722526 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
19 0.16683935 35 jmlr-2011-Forest Density Estimation
20 0.16026407 58 jmlr-2011-Learning from Partial Labels
topicId topicWeight
[(4, 0.097), (9, 0.06), (10, 0.024), (24, 0.042), (31, 0.071), (32, 0.012), (33, 0.351), (41, 0.022), (60, 0.017), (65, 0.015), (73, 0.028), (78, 0.15), (90, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.71888506 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez
Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA
2 0.48397741 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
3 0.47300628 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
4 0.46474448 22 jmlr-2011-Differentially Private Empirical Risk Minimization
Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate
Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classification, optimization, empirical risk minimization, support vector machines, logistic regression
5 0.46360335 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
6 0.46329436 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
7 0.46201748 104 jmlr-2011-X-Armed Bandits
9 0.45704612 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
10 0.45660594 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
11 0.45628333 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
12 0.45557335 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
13 0.45527431 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
14 0.45357671 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
15 0.45104557 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
16 0.44995493 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
17 0.44992751 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
18 0.44951347 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
19 0.44934791 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
20 0.44878495 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms