jmlr jmlr2013 jmlr2013-102 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
Reference: text
sentIndex sentText sentNum sentScore
1 The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. [sent-7, score-0.563]
2 The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. [sent-8, score-0.566]
3 In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. [sent-12, score-0.6]
4 Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm 1. [sent-13, score-0.783]
5 The inverse covariance matrix is also called precision matrix or concentration matrix. [sent-16, score-0.438]
6 In such cases, the sample covariance matrix is always singular and a certain type of sparsity condition is typically imposed for proper estimation of the precision matrix and for theoretical investigation of the problem. [sent-18, score-0.471]
7 (2008) provides the convergence rate {((p + s)/n) log p}1/2 in the Frobenius norm and {(s/n) log p}1/2 in the spectrum norm, where s is the number of nonzero off-diagonal entries in the precision matrix. [sent-30, score-0.624]
8 Since the spectrum norm can be controlled via the Frobenius norm, this provides a sufficient condition (s/n) log p → 0 for the convergence to the unknown precision matrix under the spectrum norm. [sent-34, score-0.76]
9 A potentially faster rate d (log p)/n can be achieved by ℓ1 regularized estimation of individual columns of the precision matrix, where d, the matrix degree, is the largest number of nonzero entries in a column. [sent-36, score-0.323]
10 When the ℓ1 operator norm of the precision matrix is bounded, this method achieves the convergence rate d (log p)/n in ℓq matrix operator norms. [sent-38, score-0.413]
11 In this paper, we propose to apply the scaled Lasso (Sun and Zhang, 2012) column-by-column to estimate a precision matrix in the high dimensional setting. [sent-46, score-0.462]
12 Based on the connection of precision matrix estimation to linear regression, we construct a column estimator with the scaled Lasso, a joint estimator for the regression coefficients and noise level. [sent-47, score-0.812]
13 Since we only need a sample covariance matrix as input, this estimator could be extended to generate an approximate inverse of a nonnegative-definite data matrix in a more general setting. [sent-48, score-0.344]
14 For each column, the penalty level of the scaled Lasso is determined by data via convex minimization, without using cross-validation. [sent-50, score-0.533]
15 Theorem 1 Let Θ and Ω be the scaled Lasso estimators defined in (4), (7) and (9) below with penalty level λ0 = A 4(log p)/n, A > 1, based on n iid observations from N(0, Σ∗ ). [sent-57, score-0.637]
16 Then, Ω − Ω∗ 2 = OP (1)d (log p)/n = o(1), where · 2 is the spectrum norm (the ℓ2 matrix operator norm). [sent-59, score-0.326]
17 Theorem 1 provides a simple boundedness condition on the spectrum norm of Ω∗ for the convergence of Ω in spectrum norm with sample size n ≫ d 2 log p. [sent-61, score-0.71]
18 The boundedness condition on the spectrum norm of (diagΣ∗ )1/2 Θ∗ (diagΣ∗ )1/2 and the diagonal of Θ∗ is weaker than the boundedness of the ℓ1 operator norm assumed in Yuan (2010) and Cai et al. [sent-63, score-0.54]
19 When the ratio of the ℓ1 operator norm and spectrum norm of the precision matrix diverges to infinity, the proposed estimator has a faster proven convergence rate. [sent-65, score-0.665]
20 This sharper result is a direct consequence of the faster convergence rate of the scaled Lasso estimator of the noise level in linear regression. [sent-66, score-0.488]
21 An important advantage of the scaled Lasso is that the penalty level is automatically set to achieve the optimal convergence rate in the regression model for the estimation of each column of the inverse matrix. [sent-69, score-0.734]
22 This raises the possibility for the scaled Lasso to outperform methods using a single unscaled penalty level for the estimation of all columns such as the GLasso and CLIME. [sent-70, score-0.618]
23 Another contribution of this paper is to study the scaled Lasso at a smaller penalty level than those based on ℓ∞ bounds of the noise. [sent-72, score-0.569]
24 For A ≈ 1 and ε = po(1) , this penalty level is comparable to the universal penalty level (2/n) log p. [sent-74, score-0.66]
25 However, ε = o(1/p), or equivalently λ0 ≈ (4/n) log p, is required if the union bound is used to simultaneously control the error of p applications of the scaled Lasso in the estimation of individual columns of a precision matrix. [sent-75, score-0.577]
26 We close this gap by providing a theory based on a sparse ℓ2 measure of the noise, corresponding to a penalty level satisfying P{N(0, 1/n) > λ0 /A} = k/p with A > 1 and a potentially large k. [sent-77, score-0.349]
27 This penalty level provides a faster convergence rate than the universal penalty level in linear regression when log(p/k) ≈ log(p/ β 0 ) ≪ log p. [sent-78, score-0.722]
28 Moreover, the new analysis provides a higher concentration of the error so that the same penalty level λ0 ≈ (2/n) log(p/k) can be used to simultaneously control the estimation error in p applications of the scaled Lasso for the estimation of a precision matrix. [sent-79, score-0.839]
29 In Section 2, we present the scaled Lasso method for the estimation of the inversion of a nonnegative definite matrix. [sent-81, score-0.342]
30 In Section 4, we provide a theory for the Lasso and its scaled version with higher proven concentration at a smaller, practical penalty level. [sent-83, score-0.516]
31 In Section 7, we discuss the benefits of using the scaled penalty levels for the estimation of different columns of the precision matrix, compared with an optimal fixed penalty level for all columns. [sent-86, score-0.958]
32 In this section, we describe the relationship between positive-definite matrix inversion and linear regression and propose an estimator for Θ∗ via scaled Lasso, a joint convex minimization for the estimation of regression coefficients and noise level. [sent-102, score-0.607]
33 A solution to resolve these two issues is the scaled Lasso (Sun and Zhang, 2012): {β∗, j , σ j } = arg min b,σ p bT Σb σ 1/2 + + λ0 ∑ Σkk |bk | : b j = −1 2σ 2 k=1 3388 (4) S PARSE M ATRIX I NVERSION WITH S CALED L ASSO with λ0 ≈ (2/n) log p. [sent-116, score-0.356]
34 j An iterative algorithm has been provided in Sun and Zhang (2012) to compute the scaled Lasso estimator (4). [sent-123, score-0.354]
35 Based on the Lasso path β∗, j (λ), the scaled Lasso estimator {β∗, j , σ j } is computed iteratively by T σ2 ← β∗, j Σβ∗, j , j λ ← σ j λ0 , β∗, j ← β∗, j (λ). [sent-129, score-0.354]
36 We then simply take advantage of the relationship (2) and compute the coefficients and noise levels by the scaled Lasso for each column diagΘ = diag(σ−2 , j = 1, . [sent-131, score-0.316]
37 The estimator Ω here is a result of normalizing the precision matrix estimator by the population variances. [sent-143, score-0.467]
38 Alternatively, we may estimate the inverse correlation matrix by using the population correlation matrix −1/2 R = (diagΣ)−1/2 Σ(diagΣ)−1/2 = D −1/2 ΣD as data matrix. [sent-144, score-0.341]
39 Thus, in this scaled Lasso approach, the estimator based on the normalized data matrix is exactly the same as the one based on the original data matrix followed by a normalization step. [sent-155, score-0.498]
40 The scaled Lasso methodology is scale-free in the noise level, and as a result, the estimator for inverse correlation matrix is also scale free in diagonal normalization. [sent-156, score-0.577]
41 A nice property of symmetric matrices is that the spectrum norm is bounded by the ℓ1 matrix norm. [sent-164, score-0.326]
42 The ℓ1 matrix norm can be expressed more explicitly as the maximum ℓ1 norm of the columns, while the ℓ∞ matrix norm is the maximum ℓ1 norm of the rows. [sent-165, score-0.556]
43 Hence, for any symmetric matrix, the ℓ1 matrix norm is equivalent to the ℓ∞ matrix norm, and the spectrum norm can be bounded by either of them. [sent-166, score-0.501]
44 Since our estimators and target matrices are all symmetric, the error bound based on the spectrum norm could be studied by bounding the ℓ1 error as typically done in the existing literature. [sent-167, score-0.349]
45 Then (7) translates the resulting estimators of (6) to column estimators and thus a preliminary matrix estimator is constructed. [sent-171, score-0.331]
46 Let Θ∗ = (Σ∗ )−1 be the precision matrix as the inverse of the population covariance matrix. [sent-175, score-0.367]
47 Let Θ and Ω be their scaled Lasso estimators defined in (4), (7) and (9) with a penalty level λ0 = A 4(log p)/n, A > 1. [sent-197, score-0.596]
48 pdf), we are able to demonstrate good numerical performance of the scaled Lasso estimator with the universal penalty level λ0 = 2(log p)/n, compared with some existing methods, but not the larger penalty level λ0 > 4(log p)/n in Theorems 1 and 2. [sent-207, score-0.912]
49 We are able to provide an affirmative answer in this version of the paper by proving a higher concentration of the error of the scaled Lasso at a smaller penalty level as follows. [sent-209, score-0.591]
50 Our new analysis of the scaled Lasso allows a threshold level λ∗,0 = Ln−3/2 (k/p) with k ≍ s log(p/s), where s = 1 + max j s∗, j . [sent-213, score-0.421]
51 (16) Theorem 3 Let {Σ, Σ∗ , Θ∗ , Ω∗ } be matrices as in Theorem 2, and Θ and Ω be the scaled Lasso estimators with a penalty level λ0 ≥ Aλ∗,0 where λ∗,0 = Ln−3/2 (k/p). [sent-215, score-0.596]
52 The condition max j s∗, j λ0 ≤ c0 on (10), which controls the capped ℓ1 sparsity of the inverse correlation matrix, weakens the ℓ0 sparsity condition d (log p)/n → 0. [sent-218, score-0.353]
53 In comparison, the spectrum norm condition is not only j 3392 S PARSE M ATRIX I NVERSION WITH S CALED L ASSO weaker than the ℓ1 operator norm condition, but also more natural for the convergence in spectrum norm. [sent-225, score-0.553]
54 Our sharper theoretical results are consequences of using the scaled Lasso estimator (4) and its fast convergence rate in linear regression. [sent-226, score-0.384]
55 In Sun and Zhang (2012), a convergence rate of order s∗ (log p)/n was established for the scaled Lasso estimation of the noise level, compared with an oracle noise level as the moment estimator based on the noise vector. [sent-227, score-0.656]
56 In the context of the columnby-column application of the scaled Lasso for precision matrix estimation, the results in Sun and Zhang (2012) can be written as σ∗ j − 1 ≤ C1 s∗, j λ2 , 0 σj 1/2 ∑ Σkk k= j |βk, j − βk, j | Θ∗ j ≤ C2 s∗, j λ0 , j (17) √ where σ∗ = Xβ∗, j 2 / n. [sent-228, score-0.462]
57 While the results in Sun and Zhang (2012) requires a penalty level A (2/n) log(p2 ) to allow simultaneous application of (17) for all j ≤ p via the union bound in proving Theorem 2, Theorem 3 allows a smaller penalty level λ∗,0 = ALn−3/2 (k/p) with A > 1 and a potentially large k ≍ s log(p/s). [sent-236, score-0.558]
58 Linear Regression Revisited This section provides certain new error bounds for the Lasso and scaled Lasso in the linear regression model. [sent-239, score-0.322]
59 Let λuniv = (2/n) log p be the universal penalty level (Donoho and Johnstone, 1994). [sent-243, score-0.381]
60 For the estimation of β and variable selection, existing theoretical results with p ≫ n typically require a penalty level λ = Aσλuniv , with A > 1, to guarantee rate optimality of regularized estimators. [sent-244, score-0.365]
61 When the (scaled) Lasso is simultaneously applied to p subproblems as in the case of matrix estimation, the new oracle inequalities allow the use of the union bound to uniformly control the estimation error in subproblems at the same penalty level. [sent-255, score-0.417]
62 To bound the effect of the noise when λ < X T ε/n ∞ , we use a certain sparse ℓ2 norm to control the excess of X T ε/n over a threshold level λ∗ . [sent-260, score-0.338]
63 This leads to the definition of uT Σu/A2 : u ∈ U (Σ, S, B; A, A1 , m, m1 ) m1 + |S| (23) u q /A : u ∈ U (Σ, S, B; A, A1 , m, m1 ) (m1 + |S|)1/q (24) M ∗ = sup pred as a constant factor for the prediction error of the Lasso and ∗ Mq = sup for the ℓq estimation error of the Lasso. [sent-280, score-0.315]
64 The following theorem provides analytic error bounds for the Lasso prediction and estimation under the sparse ℓ2 norm condition (21) on the noise. [sent-281, score-0.336]
65 In the case of Gaussian error, (21) allows a fixed threshold level λ∗ = σ (2/n) log(p/m) to uniformly control the error of p applications of the Lasso for the estimation of a precision matrix. [sent-283, score-0.315]
66 Let A > 1, β = β(λ) be the Lasso estimator with penalty level λ ≥ Aλ∗ , h = β − β, and m1 = s∗ − |S|. [sent-286, score-0.379]
67 pred ∗ The purpose of including a choice B in (21) is to achieve bounded {M ∗ , M1 } in the presence pred of some highly correlated design vectors outside S ∪ B when ΣS∪B,(S∪B)c is small. [sent-297, score-0.518]
68 2 Scaled Lasso with Smaller Penalty: Analytical Bounds The scaled Lasso estimator is defined as {β, σ} = arg min b,σ y − Xb 2 /(2nσ) + λ0 b 2 1 + σ/2 , (26) where λ0 > 0 is a scale-free penalty level. [sent-306, score-0.558]
69 A scaled version of (19) is |S| + ∑ |β j |/(σ∗ λ∗,0 ) = s∗,0 ≤ s∗ , (27) j∈S √ where σ∗ = ε 2 / n is an oracle estimate of the noise level and λ∗,0 > 0 is a scaled threshold level. [sent-308, score-0.714]
70 Let {β, σ} be the scaled Lasso estimator in (26), φ1 = 1/ 1 + η0 , √ √ φ2 = 1/ 1 − η0 , and σ∗ = ε 2 / n. [sent-315, score-0.354]
71 q < (1 + ε0 ) 2 M ∗ s∗ (σφ2 λ0 )2 , pred 2 2 /n Compared with Theorem 6, Theorem 8 requires nearly identical conditions on the design X, the noise and penalty level under proper scale. [sent-323, score-0.567]
72 Probabilistic upper pred bounds for the noise and consequences of their combination with Theorems 6 and 8 are discussed ∗ ∗ in Section 4. [sent-326, score-0.324]
73 pred Existing analyses of the Lasso and Dantzig selector can be to used find upper bounds for ∗ ∗ e {M ∗ , Mq , Mσ } via the sparse eigenvalues (Cand` s and Tao, 2005, 2007; Zhang and Huang, 2008; pred Zhang, 2009; Cai et al. [sent-329, score-0.653]
74 The following lemma provide some simple bounds used in our analysis of the scaled Lasso estimation of the precision matrix. [sent-335, score-0.512]
75 M ∗ + 2(1 + A1 ) 1 + pred c∗ A A As∗ A s∗ (33) ∗ M ∗ + M1 1 − pred and ∗ Mσ ≤ 1 + Moreover, if in addition B = {1, . [sent-340, score-0.518]
76 pred 3397 (34) S UN AND Z HANG The main condition of Lemma 9, c∗ ≤ inf uT Σu : u ∈ U (Σ, S, B; A, A1 , m, m1 ) , uS∪B 2 2 (35) can be viewed as a restricted eigenvalue condition (Bickel et al. [sent-344, score-0.386]
77 For p applications of the scaled Lasso in the estimation of precision matrix, this also provides a more practical penalty level compared with A′ Ln (ε/p2 ) ≈ A′ (4/n) log(p/ε1/2 ), A′ > 1 and ε ≪ 1, √ based on existing results and Theorem 11. [sent-404, score-0.725]
78 Estimation after Model Selection We have presented theoretical properties of the scaled Lasso for linear regression and precision matrix estimation. [sent-421, score-0.494]
79 In this section, we extend the theory to smaller threshold level and to the estimation of precision matrix. [sent-424, score-0.315]
80 J / J∩S=0,|J|≤m∗ uJ 2 =1 It is proved in Sun and Zhang (2012) that {β, σ} satisfies prediction and estimation error bounds of the same order as those for the scaled Lasso (26) under some extra conditions on κ∗ (m∗ , S; Σ). [sent-427, score-0.346]
81 Theorem 15 Let (β, σ) be the scaled lasso estimator in (26) and (β, σ) be the least squares estimator (42) in the selected model S = supp(β). [sent-431, score-0.898]
82 3401 (44) S UN AND Z HANG Theorem 15 asserts that when k ∨ m∗ ≍ s∗ , the least squares estimator {β, σ} after the scaled Lasso selection enjoys estimation and prediction properties comparable to that of the scaled Lasso: λ−2 0 σ/σ∗ − 1 + β − β 2 + 2 2 Xβ − Xβ 2 /n + β 0 = OP (1)s∗ . [sent-436, score-0.695]
83 Corollary 16 Under the additional condition R∗ ∗ 2 = O(1) on the population correlation matrix LSE R , Theorems 2 and 3 are applicable to the estimator Θ Ω∗ = (R∗ )−1 with possibly different numerical constants. [sent-443, score-0.32]
84 In addition to the proposed estimator (7) and (9) based on the scaled Lasso (4) and the least squares estimation after the scale Lasso (45), the graphical Lasso and CLIME are considered. [sent-446, score-0.441]
85 The scaled 3402 S PARSE M ATRIX I NVERSION WITH S CALED L ASSO Lasso estimators are computed based on the training sample alone with penalty level λ0 = ALn (k/p), √ 4 2 where A = 2 and k is the solution of k = L1 (k/p)+2L1 (k/p). [sent-469, score-0.596]
86 The estimation error is measured by three matrix norms: the spectrum norm, the matrix ℓ1 norm and the Frobenius norm. [sent-475, score-0.454]
87 The least squares estimator after the scaled Lasso selection outperforms all estimators by large margin in the spectrum and Frobenius losses in Models 1 and 3, but in general underperforms in the ℓ1 operator norm and in Model 2. [sent-478, score-0.671]
88 Both the scaled Lasso and the CLIME are resulting from sparse linear regression solutions. [sent-481, score-0.327]
89 A main advantage of the scaled Lasso over the CLIME is adaptive choice of the penalty level for the estimation of each column of the precision matrix. [sent-482, score-0.758]
90 Discussion Since the scaled Lasso choose penalty levels adaptively in the estimation of each column of the precision matrix, it is expected to outperform methods using a fixed penalty level for all columns in the presence of heterogeneity of the diagonal of the precision matrix. [sent-487, score-1.155]
91 (46) The CLIME is a symmetrization of this estimator Θ(λ) with fixed penalty level for all columns. [sent-492, score-0.418]
92 In the following example, the scaled Lasso estimator has a faster convergence rate than (46). [sent-493, score-0.384]
93 03) Table 1: Estimation errors under various matrix norms of scaled Lasso, GLasso and CLIME for three models. [sent-904, score-0.326]
94 (i) Let Θ be the scaled Lasso estimator of Θ∗ = (Σ∗ )−1 with penalty level λ0 = A (4/n) log p, ∗ A > 1, as in Theorem 2. [sent-911, score-0.735]
95 Thus, the order of the ℓ1 and spectrum norms of the error of (46) for the best data dependent penalty √ level λ is larger than that of the scaled Lasso by a factor m. [sent-914, score-0.684]
96 ∗ This and the definition of {M ∗ , M1 } yield (32) via pred ∗ ξM ∗ + M1 ≤ max pred Moreover, (52) gives c∗ uS∪B 2 2 1 ∨ (m/s∗ ))(2ξ + ξ1 )2 4 , ξc∗ (1 − |S|/s∗ )/A2 . [sent-960, score-0.562]
97 The above inequalities and (55) yield |J| ≤ κ∗ (m∗ , S)M ∗ s∗ λ2 + pred λ − σ∗ λ∗,0 − σ∗ ξ1 (A − 1)λ∗,0 2 + < 3411 κ∗ (m∗ , S)M ∗ s∗ + pred 1 − 1/A − ξ1 (1 − 1/A) 2 + ≤ m∗ . [sent-1039, score-0.549]
98 Then, P |R jk − R∗ | > x jk 1 − (R∗ )2 ≤ 2P |tn | > n1/2 x jk where tn has the t-distribution with n degrees of freedom. [sent-1054, score-0.399]
99 Thus, jk z jk = nΣkk (1 − (R∗ )2 )Σ j j jk 3412 1/2 Σ jk Σ jk − Σkk Σkk S PARSE M ATRIX I NVERSION WITH S CALED L ASSO = n 1 − (R∗ )2 jk 1/2 1/2 R jk 1/2 Σjj Σkk Σjj Σkk − R jk 1/2 1/2 is a N(0, 1) variable independent of Σkk . [sent-1061, score-1.064]
100 Consequently, n1/2 |R jk − R jk | |z jk + zk j | ≤ |t jk | ∨ |tk j | = (1 − (R∗ )2 )1/2 Σ1/2 /Σ1/2 + Σ1/2 /Σ1/2 jk jj kk jj kk 1/2 1/2 with t jk = z jk Σkk /Σkk ∼ tn . [sent-1062, score-1.267]
wordName wordTfidf (topN-words)
[('lasso', 0.444), ('pred', 0.259), ('scaled', 0.254), ('clime', 0.232), ('penalty', 0.204), ('glasso', 0.185), ('ln', 0.182), ('asso', 0.174), ('nversion', 0.174), ('spectrum', 0.151), ('caled', 0.149), ('precision', 0.136), ('jk', 0.133), ('kk', 0.132), ('ut', 0.117), ('diag', 0.117), ('slasso', 0.116), ('atrix', 0.116), ('ub', 0.109), ('na', 0.108), ('norm', 0.103), ('log', 0.102), ('estimator', 0.1), ('hang', 0.095), ('parse', 0.094), ('mq', 0.091), ('xh', 0.089), ('un', 0.076), ('level', 0.075), ('zhang', 0.074), ('univ', 0.074), ('sun', 0.073), ('matrix', 0.072), ('estimators', 0.063), ('ye', 0.063), ('dantzig', 0.062), ('xb', 0.062), ('cai', 0.059), ('population', 0.059), ('concentration', 0.058), ('selector', 0.058), ('estimation', 0.056), ('boundedness', 0.055), ('theorem', 0.055), ('proposition', 0.055), ('oracle', 0.054), ('frobenius', 0.05), ('inverse', 0.05), ('covariance', 0.05), ('zb', 0.049), ('threshold', 0.048), ('mlse', 0.046), ('capped', 0.045), ('condition', 0.045), ('correlation', 0.044), ('max', 0.044), ('supp', 0.042), ('excess', 0.042), ('sparse', 0.041), ('ps', 0.041), ('iid', 0.041), ('yuan', 0.04), ('sparsity', 0.04), ('uj', 0.04), ('theorems', 0.039), ('symmetrization', 0.039), ('bc', 0.038), ('ii', 0.038), ('inf', 0.037), ('cand', 0.036), ('jj', 0.036), ('bounds', 0.036), ('aln', 0.035), ('hsc', 0.035), ('mln', 0.035), ('tingni', 0.035), ('geer', 0.033), ('column', 0.033), ('inversion', 0.032), ('regression', 0.032), ('target', 0.032), ('invertibility', 0.031), ('asserts', 0.031), ('graphical', 0.031), ('inequalities', 0.031), ('bickel', 0.03), ('suppose', 0.03), ('let', 0.03), ('rothman', 0.03), ('lse', 0.03), ('raskutti', 0.03), ('lemma', 0.03), ('rate', 0.03), ('columns', 0.029), ('noise', 0.029), ('alt', 0.029), ('op', 0.029), ('sm', 0.029), ('satisfying', 0.029), ('diagonal', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
2 0.18443657 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
Author: Wei Sun, Junhui Wang, Yixin Fang
Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. Keywords: kappa coefficient, penalized regression, selection consistency, stability, tuning
3 0.11222787 114 jmlr-2013-The Rate of Convergence of AdaBoost
Author: Indraneel Mukherjee, Cynthia Rudin, Robert E. Schapire
Abstract: The AdaBoost algorithm was designed to combine many “weak” hypotheses that perform slightly better than random guessing into a “strong” hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the “exponential loss.” Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost’s computed parameter vector will be at most ε more than that of any parameter vector of ℓ1 -norm bounded by B in a number of rounds that is at most a polynomial in B and 1/ε. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/ε iterations, AdaBoost achieves a value of the exponential loss that is at most ε more than the best possible value, where C depends on the data set. We show that this dependence of the rate on ε is optimal up to constant factors, that is, at least Ω(1/ε) rounds are necessary to achieve within ε of the optimal exponential loss. Keywords: AdaBoost, optimization, coordinate descent, convergence rate
4 0.10124438 104 jmlr-2013-Sparse Single-Index Model
Author: Pierre Alquier, Gérard Biau
Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method
5 0.095658153 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
6 0.095488034 106 jmlr-2013-Stationary-Sparse Causality Network Learning
7 0.093896009 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
8 0.09196604 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
9 0.09080667 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
10 0.090095922 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
11 0.085158519 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
12 0.083615817 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
13 0.079401754 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
14 0.079078652 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
15 0.071651272 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
16 0.06882266 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
17 0.066890836 76 jmlr-2013-Nonparametric Sparsity and Regularization
18 0.059612632 103 jmlr-2013-Sparse Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory
19 0.057109442 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
20 0.053793032 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
topicId topicWeight
[(0, -0.281), (1, 0.078), (2, 0.109), (3, 0.146), (4, 0.187), (5, -0.191), (6, 0.049), (7, -0.099), (8, 0.3), (9, 0.049), (10, -0.012), (11, -0.043), (12, -0.159), (13, 0.121), (14, 0.013), (15, -0.134), (16, 0.054), (17, 0.072), (18, 0.101), (19, -0.091), (20, 0.041), (21, -0.003), (22, 0.165), (23, -0.023), (24, 0.035), (25, -0.06), (26, -0.037), (27, -0.08), (28, -0.007), (29, -0.087), (30, -0.075), (31, 0.05), (32, -0.018), (33, 0.068), (34, -0.0), (35, 0.003), (36, -0.013), (37, 0.075), (38, -0.015), (39, -0.014), (40, -0.019), (41, 0.037), (42, 0.036), (43, 0.001), (44, 0.009), (45, 0.033), (46, 0.009), (47, 0.045), (48, -0.029), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.95761782 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
2 0.7657631 27 jmlr-2013-Consistent Selection of Tuning Parameters via Variable Selection Stability
Author: Wei Sun, Junhui Wang, Yixin Fang
Abstract: Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Whereas success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-off between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross validation, AIC and BIC. This article introduces a general tuning parameter selection criterion based on variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. The asymptotic selection consistency is established for both fixed and diverging dimensions. Its effectiveness is also demonstrated in a variety of simulated examples as well as an application to the prostate cancer data. Keywords: kappa coefficient, penalized regression, selection consistency, stability, tuning
3 0.60783958 106 jmlr-2013-Stationary-Sparse Causality Network Learning
Author: Yuejia He, Yiyuan She, Dapeng Wu
Abstract: Recently, researchers have proposed penalized maximum likelihood to identify network topology underlying a dynamical system modeled by multivariate time series. The time series of interest are assumed to be stationary, but this restriction is never taken into consideration by existing estimation methods. Moreover, practical problems of interest may have ultra-high dimensionality and obvious node collinearity. In addition, none of the available algorithms provides a probabilistic measure of the uncertainty for the obtained network topology which is informative in reliable network identification. The main purpose of this paper is to tackle these challenging issues. We propose the S2 learning framework, which stands for stationary-sparse network learning. We propose a novel algorithm referred to as the Berhu iterative sparsity pursuit with stationarity (BISPS), where the Berhu regularization can improve the Lasso in detection and estimation. The algorithm is extremely easy to implement, efficient in computation and has a theoretical guarantee to converge to a global optimum. We also incorporate a screening technique into BISPS to tackle ultra-high dimensional problems and enhance computational efficiency. Furthermore, a stationary bootstrap technique is applied to provide connection occurring frequency for reliable topology learning. Experiments show that our method can achieve stationary and sparse causality network learning and is scalable for high-dimensional problems. Keywords: stationarity, sparsity, Berhu, screening, bootstrap
4 0.54635251 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
Author: Pinghua Gong, Jieping Ye, Changshui Zhang
Abstract: Multi-task sparse feature learning aims to improve the generalization performance by exploiting the shared features among tasks. It has been successfully applied to many applications including computer vision and biomedical informatics. Most of the existing multi-task sparse feature learning algorithms are formulated as a convex sparse regularization problem, which is usually suboptimal, due to its looseness for approximating an ℓ0 -type regularizer. In this paper, we propose a non-convex formulation for multi-task sparse feature learning based on a novel non-convex regularizer. To solve the non-convex optimization problem, we propose a Multi-Stage Multi-Task Feature Learning (MSMTFL) algorithm; we also provide intuitive interpretations, detailed convergence and reproducibility analysis for the proposed algorithm. Moreover, we present a detailed theoretical analysis showing that MSMTFL achieves a better parameter estimation error bound than the convex formulation. Empirical studies on both synthetic and real-world data sets demonstrate the effectiveness of MSMTFL in comparison with the state of the art multi-task sparse feature learning algorithms. Keywords: multi-task learning, multi-stage, non-convex, sparse learning
5 0.52546299 104 jmlr-2013-Sparse Single-Index Model
Author: Pierre Alquier, Gérard Biau
Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method
6 0.48071149 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit
7 0.46381864 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
8 0.45415246 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning
9 0.4211185 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
10 0.41473851 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
11 0.41443411 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
12 0.41361505 114 jmlr-2013-The Rate of Convergence of AdaBoost
13 0.40691224 76 jmlr-2013-Nonparametric Sparsity and Regularization
14 0.39728299 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres
15 0.3928906 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
16 0.3788029 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
17 0.3767978 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
18 0.37485066 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
19 0.36899623 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
20 0.35535991 97 jmlr-2013-Risk Bounds of Learning Processes for Lévy Processes
topicId topicWeight
[(0, 0.03), (5, 0.129), (6, 0.033), (10, 0.064), (14, 0.011), (20, 0.021), (23, 0.052), (51, 0.316), (68, 0.043), (70, 0.046), (75, 0.029), (85, 0.07), (87, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.71907306 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
Author: Tingni Sun, Cun-Hui Zhang
Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm
2 0.49367994 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
Author: Sébastien Gerchinovitz
Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds
3 0.48364568 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
4 0.47334471 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki
Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency
5 0.4710364 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
6 0.46895033 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach
7 0.46810952 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
9 0.46453857 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
10 0.46202314 9 jmlr-2013-A Widely Applicable Bayesian Information Criterion
11 0.46154198 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
12 0.45976633 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
14 0.45911607 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
15 0.45768228 68 jmlr-2013-Machine Learning with Operational Costs
16 0.45736286 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
17 0.45624328 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
18 0.45618832 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
19 0.45538864 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
20 0.45463797 107 jmlr-2013-Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization