jmlr jmlr2011 jmlr2011-39 knowledge-graph by maker-knowledge-mining

39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Source: pdf

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

Abstract: Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using ℓ1 -penalization methods. We propose and study the following method. We combine a multiple regression approach with ideas of thresholding and reﬁtting: ﬁrst we infer a sparse undirected graphical model structure via thresholding of each among many ℓ1 -norm penalized regression functions; we then estimate the covariance matrix and its inverse using the maximum likelihood estimator. We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. We also derive an explicit bound for the Kullback Leibler divergence. Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. u u ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We show that under suitable conditions, this approach yields consistent estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. [sent-14, score-0.209]

2 Keywords: graphical model selection, covariance estimation, Lasso, nodewise regression, thresholding c 2011 Shuheng Zhou, Philipp R¨ timann, Min Xu and Peter B¨ hlmann. [sent-16, score-0.334]

3 Introduction There have been a lot of recent activities for estimation of high-dimensional covariance and inverse covariance matrices where the dimension p of the matrix may greatly exceed the sample size n. [sent-18, score-0.253]

4 We focus here on the latter class with permutation invariant estimation and we aim for an estimator which is accurate for both the covariance matrix Σ and its inverse, the precision matrix Σ−1 . [sent-26, score-0.234]

5 The GLasso approach is simple to use, at least when relying on publicly available software such as the glasso package in R. [sent-30, score-0.309]

6 We follow the nodewise regression approach from Meinshausen and B¨ hlmann [2006] but we make use of recent results for variable selection in linear models assuming u the much weaker restricted eigenvalue condition [Bickel et al. [sent-51, score-0.209]

7 In some sense, the novelty of our theory extending beyond Zhou [2010a] is the analysis for covariance and inverse covariance estimation and for risk consistency based on an estimated sparse graph as we mentioned above. [sent-53, score-0.302]

8 Our regression and thresholding results build upon analysis of the thresholded Lasso estimator as studied in Zhou [2010a]. [sent-54, score-0.212]

9 Another relation can be made to the method by R¨ timann u and B¨ hlmann [2009] for covariance and inverse covariance estimation based on a directed acyclic u graph. [sent-63, score-0.297]

10 For example, if all non-zero non-diagonal elements θ0,i j of the ith row are larger in absolute value than λ θ0,ii , the value si 0,n coincides with the node degree si . [sent-134, score-0.248]

11 We ﬁrst estimate the undirected 0 graph with edge set E0 as in (6) and we then use the maximum likelihood estimator based on the estimate En , that is, the non-zero elements of Θn correspond to the estimated edges in En . [sent-139, score-0.19]

12 Inferring the edge set E0 can be based on the following approach as proposed and theoretically justiﬁed in Meinshausen and B¨ hlmann [2006]: perform p regressions using the Lasso to obtain p vectors of u regression coefﬁcients β1 , . [sent-140, score-0.197]

13 Consider the Lasso for each of the nodewise regressions βi = argminβi init n ∑ (Xi r=1 (r) − ∑ βij X j )2 + λn ∑ |βij | for i = 1, . [sent-150, score-0.301]

14 The use of thresholding has clear beneﬁts from a theoretical point of view: the number of false positive selections may be much larger without thresholding (when tuned for good prediction). [sent-157, score-0.186]

15 The estimator for the concentration matrix in view of (1) is: Θn (E) = argminΘ∈M p,E tr(ΘΓn ) − log |Θ| , where M p,E = {Θ ∈ R p×p ; Θ ≻ 0 and θi j = 0 for all (i, j) ∈ E, where i = j} (13) deﬁnes the constrained set for positive deﬁnite Θ. [sent-165, score-0.217]

16 If the edge set E is sparse having relatively few edges only, the estimator in (13) is already sufﬁciently regularized by the constraints and hence, no additional penalization is used at this stage. [sent-175, score-0.214]

17 Given (8), (7), and Θ0 , deﬁne 2 Cdiag := min{ max θ2 , max si /S0,n · diag(Θ0 ) 0,ii 0,n i=1,. [sent-209, score-0.314]

18 We note that convergence rates for the estimated covariance matrix and for predictive risk depend on the rate in Frobenius norm of the estimated inverse covariance matrix. [sent-224, score-0.279]

19 In this case, we have Θn − Θ0 F = OP (p + S0,n ) log max(n, p)/n , Σn − Σ0 F = OP (p + S0,n ) log max(n, p)/n , R(Θn ) − R(Θ0 ) = OP ((p + S0,n ) log max(n, p)/n) . [sent-262, score-0.216]

20 It is also of interest to understand the bias of the estimator caused by using the estimated edge set En instead of the true edge set E0 . [sent-266, score-0.203]

21 Although we do not intend to make precise the exact conditions and choices of tuning parameters in regression and thresholding in order to achieve (21), we state Theorem 5, in case (21) holds with the following condition: the number of false positives is bounded as |E \ E0 | = O(S). [sent-283, score-0.183]

22 Suppose on some event E , such that P (E ) ≥ 1 − d/p2 for a small constant d, we obtain an edge set E such that E0 ⊆ E and |E \ E0 | = O(S). [sent-286, score-0.215]

23 First, we note that to obtain with high probability the exact edge recovery, E = E0 , we need again sufﬁciently large non-zero edge weights and some restricted eigenvalue (RE) conditions on the covariance matrix as deﬁned in Section A even for the multi-stage procedure. [sent-295, score-0.227]

24 [2009], where the second stage estimator β corresponding to (11) is obtained with nodewise regressions using adaptive Lasso [Zou, 2006] rather than thresholding as in the present work in order to recover the edge set E0 with high probability under an assumption which is stronger than (A0). [sent-297, score-0.41]

25 We ﬁrst prove an init oracle result on nodewise regressions in Theorem 15. [sent-324, score-0.33]

26 , p for init thresholding our initial regression coefﬁcients. [sent-330, score-0.228]

27 Hence the entries in the covariance matrix Σn for the chosen set of edges in E and the diagonal entries are set to their corresponding values in Sn . [sent-357, score-0.228]

28 Indeed, we can derive the above relationships via the Lagrange form, where we add Lagrange constants γ jk for edges in D , ℓC (Θ) = log |Θ| − tr(Sn Θ) − ∑ γ jk θ jk . [sent-358, score-0.419]

29 (26) ( j,k)∈D Now the gradient equation of (26) is: Θ−1 − Sn − Γ = 0, where Γ is a matrix of Lagrange parameters such that γ jk = 0 for all ( j, k) ∈ D and γ jk = 0 otherwise. [sent-359, score-0.23]

30 Subject to the constraints that Ω ∈ Sn as deﬁned in (28), we write the maximum likelihood estimate for Ω0 : Ωn (E) := arg min Rn (Ω) = arg Ω∈Sn min p p Ω∈S++ ∩SE tr(ΩΓn ) − log |Ω| , (29) which yields the following relationships regarding Ωn (E), its inverse Ψn = (Ωn (E))−1 , and Γn . [sent-390, score-0.261]

31 Let E be some event such that max min P (E ) ≥ 1 − d/p2 for a small constant d. [sent-402, score-0.332]

32 (30) Let Ωn (E) be as deﬁned in (29) Suppose the sample size satisﬁes for C3 ≥ 4 5/3, n> 144σ4 max 2 Mlow 4C3 + 13Mupp 12σ2 min 2 2 max 2|E| log max(n, p), Cbias 2S0,n log p . [sent-408, score-0.415]

33 (31) Then with probability ≥ 1 − (d + 1)/p2 , we have for M = (9σ4 /(2k2 )) · 4C3 + 13Mupp /(12σ2 ) max min Ωn (E) − Ω0 F ≤ (M + 1) max 2|E| log max(n, p)/n, Cbias 2S0,n log(p)/n . [sent-409, score-0.343]

34 We compare our estimation method with the GLasso, the Space method and a simpliﬁed Gelato estimator without thresholding for inferring the conditional independence graph. [sent-416, score-0.188]

35 For the Space method, estimation of Θ0 is done via maximum likelihood as (Space) in (13) based on the edge set En from the estimated sparse partial correlation matrix. [sent-425, score-0.181]

36 Our Gelato method also addresses the bias problem inherent in the GLasso estimator since we no longer shrink the entries in the covariance matrix corresponding to the selected edge set En in the maximum likelihood estimate, as shown in Section 3. [sent-609, score-0.289]

37 Consider nodewise regressions as in (2), where we are given vectors of parameters {βij , j = 1, . [sent-633, score-0.201]

38 With respect to the degree of node i for each i, we deﬁne 2999 ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN si ≤ si ≤ s as the smallest integer such that 0 p ∑ j=1, j=i min((βij )2 , λ2 Var(Vi )) ≤ si λ2 Var(Vi ), where λ = 0 2 log p/n, (35) where si denotes si as deﬁned in (7). [sent-640, score-0.692]

39 For si as in (35), deﬁne 0 s0 := max si ≤ s and S0,n := 0 i=1,. [sent-643, score-0.343]

40 For an integer m ≤ p, we deﬁne the smallest and largest m-sparse eigenvalues of Σ0 as follows: 1/2 ρmin (m) := min t=0;m−sparse Σ0 t t 1/2 2 ρmax (m) := , 2 max t=0;m−sparse Σ0 t t 2 . [sent-651, score-0.207]

41 In fact it holds that |θ0,i j | < λ θ0,ii which is equivalent to |βij | < λσVi , for all j ≥ si + 1 + Ii≤si0 +1 , 0 (38) where I{·} is the indicator function, if we order the regression coefﬁcients as follows: |βi | ≥ |βi |. [sent-671, score-0.185]

42 The following event R provides an upper bound on K(s0 , k0 , X) for a given k0 > 0 when Σ0 satisﬁes RE(s0 , k0 , Σ0 ) condition: R (θ) := X : RE(s0 , k0 , X) holds with 0 < K(s0 , k0 , X) ≤ K(s0 , k0 , Σ0 ) . [sent-704, score-0.209]

43 Then, for a random design X as generated by (15), we have P (X ) := P (R (θ) ∩ F (θ) ∩ M (θ)) ≥ 1 − 3 exp(−cθ2 n/α4 ) as long as the sample size satisﬁes n > max 80sα4 9c′ α4 5ep max 36K 2 (s0 , 4, Σ0 ) f (s0 ), log p , 2 log 2 θ θ sθ . [sent-718, score-0.334]

44 For some a ≥ 6, let event √ Ca := max |Z jk | < 1 + a (2 log p)/n where a ≥ 6 . [sent-740, score-0.423]

45 Consider for some i constant C3 > 4 5/3, X0 := max |∆ jk | < C3 σi σ j log max{p, n}/n < 1/2 . [sent-745, score-0.267]

46 It is also clear that event Ca is equivalent to the event to be deﬁned in (44). [sent-747, score-0.312]

47 Lemma 12 also justiﬁes the choice of λn in nodewise regressions (cf. [sent-748, score-0.201]

48 It is clear that event (44) is the same as event Ca . [sent-777, score-0.312]

49 By the union bound and by taking τ = C2 with σ jk = 0, ∀ j, k, where log p n in (45) 2(1 + a) ≥ C2 > 2 10/3 for a ≥ 6. [sent-779, score-0.199]

50 1 − P (Ca ) = P max |Z jk | ≥ 2(1 + a) jk log p n ≤ P max |Z jk | ≥ C2 jk ≤ p2 exp − 2 3C2 log p 10 = p− log p n ≤ (p2 − p) exp − 2 3C2 10 +2 < 2 3C2 log p 10 1 , p2 where we apply Lemma 14 with ρ jk = 0, ∀ j, k = 1, . [sent-780, score-0.978]

51 In order to bound the probability of event X0 , we ﬁrst state the following bound for the non-diagonal entries of Σ0 , which follows immediately from Lemma 14 by plugging in σ2 = σ0,ii = 1, ∀i = 1, . [sent-785, score-0.246]

52 , p i and using the fact that |σ0, jk | = |ρ jk σ j σk | ≤ 1, ∀ j = k, where ρ jk is the correlation coefﬁcient between variables X j and Xk . [sent-788, score-0.333]

53 Then 0, P |∆ jk | > τ ≤ exp − 3nτ2 10(1 + σ2 jk ) 0, ≤ exp − 3nτ2 20 for 0 ≤ τ ≤ Ψ jk . [sent-790, score-0.3]

54 Bounds for Nodewise Regressions In Theorem 15 and Lemma 16 we let si be as in (35) and T0i denote locations of the si largest 0 0 coefﬁcients of βi in absolute values. [sent-798, score-0.248]

55 Consider the nodewise regressions in (10), where for each i, we regress Xi onto the other variables {Xk ; k = i} following (2), where Vi ∼ N(0, Var(Vi )) is independent of X j , ∀ j = i as in (3). [sent-812, score-0.201]

56 Let h T0 init on Ca ∩ X , where X := R (θ) ∩ F (θ) ∩ M (θ), we have βi − βi init 2 ≤ λ si d0 0 hT01 2 ≤ D0 d0 λ 2D2 + 2D2 + 2, where 0 1 hi 0c T si and 0 1 = βi 0c init,T 1 ≤ D1 d0 λsi , 0 (48) where D0 , D1 are deﬁned in (82) and (83) respectively. [sent-816, score-0.448]

57 3005 ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN It is also clear that on Ca ∩ X , event Ta ∩ X holds for each regression when we invoke Theorem 33, i with Y := Xi and X := X·\i , for i = 1, . [sent-825, score-0.217]

58 Let βi be an optimal solution init to (10) with λn = d0 λ where d0 = c0 (1 + θ)2 ρmax (s)ρmax (3s0 ); Suppose for each regression, we apply the same thresholding rule to obtain a subset I i as follows, I i = { j : j = i, βij,init ≥ t0 = f0 λ}, and D i := {1, . [sent-840, score-0.193]

59 Then we have on event Ca ∩ X , |I i | ≤ si (1 + D1 /D4 ) and |I i ∪ Si | ≤ si + (D1 /D4 )si , and 0 0 βi D 2 ≤ d0 λ si 0 (49) 1 + (D0 + D4 )2 , where D is understood to be D i . [sent-847, score-0.528]

60 Then on event Ca ∩ X , for Θ0 as in (24) and E as in (23), we have for S0,n as in (36) and Θ0 = (θ0,i j ) |E| ≤ (1 + D1 /D4 )S0,n where |E \ E0 | ≤ (D1 /D4 )S0,n , (50) and Θ0,D F := Θ0 − Θ0 F ≤ min{S0,n ( max θ2 ), s0 diag(Θ0 ) 0,ii := S0,n (1 + (D0 + D4 )2 )Cdiag d0 λ, i=1,. [sent-852, score-0.251]

61 0 We have by (48), |I i ∩ T0c | ≤ βi 0c init,T 1 1 ≤ D1 d0 si /(D4 d0 ) ≤ D1 si /D4 , 0 0 f0 λ (52) where D1 is understood to be the same constant that appears in (48). [sent-872, score-0.248]

62 Suppose the sample size satisﬁes for C3 ≥ 4 5/3, n> 32 106 4C3 + 2 31c2 k 2 2 max 2|E| log max(n, p), Cbias 2S0,n log p . [sent-890, score-0.239]

63 (55) Then on event E ∩ X0 , we have for M = (9/(2k2 )) · 4C3 + 32/(31c2 ) Θn (E) − Θ0 F ≤ (M + 1) max 2|E| log max(n, p)/n, Cbias 2S0,n log(p)/n . [sent-891, score-0.323]

64 We now state bounds for the convergence rate on Frobenius norm of the covariance matrix and for KL divergence. [sent-897, score-0.175]

65 Suppose the sample size satisﬁes for C3 ≥ 4 5/3 and Cbias , M as deﬁned in Theorem 19 n> 106 c2 k4 4C3 + 32 31c2 2 2 max 2|E| log max(p, n), Cbias 2S0,n log p . [sent-904, score-0.239]

66 (57) Then on event E ∩ X0 , we have ϕmin (Θn (E)) > c/2 > 0 and for Σn (E) = (Θn (E))−1 , Σn (E) − Σ0 F ≤ 2(M + 1) max c2 2|E| log max(n, p) , Cbias n 2S0,n log(p) n Theorem 21 Suppose all conditions, events, and bounds on |E| and Θ0,D F . [sent-905, score-0.353]

67 Then on event E ∩ X0 , we have for R(Θn (E)) − R(Θ0 ) ≥ 0, 2 R(Θn (E)) − R(Θ0 ) ≤ M(C3 + 1/8) max 2|E| log max(n, p)/n, Cbias 2S0,n log(p)/n . [sent-909, score-0.323]

68 In view of Corollary 17, we have on E := X ∩ Ca : for Cdiag as in (17) |E| Θ0,D F ≤ (1 + D1 )S0,n ≤ 2S0,n for D4 ≥ D1 and D4 Θ0 − Θ0 := F ≤ Cbias 2S0,n log(p)/n ≤ c/32, where 2 Cbias := min = max θ2 , 0,ii i=1,. [sent-912, score-0.176]

69 Then we have on event E , |E| ≤ (1 + D1 )S0,n and |E \ E0 | ≤ D1 S0,n by (50); And on event E ∩ X0 , ′ for Cbias := Cdiag d0 1 + (D0 + 1)2 , Θn (E) − Θ0 F ≤ M max 2(1 + D1 )S0,n log max(n, p) ′ , Cbias n 2S0,n log p n . [sent-923, score-0.551]

70 2 Proof of Theorem 19 Suppose event E holds throughout this proof. [sent-926, score-0.182]

71 Deﬁne for Rn as in expression (53) Q(Θ) := Rn (Θ) − Rn (Θ0 ) = tr(ΘΓn ) − log |Θ| − tr(Θ0 Γn ) + log |Θ0 | = tr (Θ − Θ0 )(Γn − Σ0 ) − (log |Θ| − log |Θ0 |) + tr (Θ − Θ0 )Σ0 . [sent-934, score-0.514]

72 (62) rn = max n n Deﬁne for ∆ ∈ Un (Θ0 ), G(∆) := Q(Θ0 + ∆) = tr(∆(Γn − Σ0 )) − (log |Θ0 + ∆| − log |Θ0 |) + tr(∆Σ0 ). [sent-939, score-0.235]

73 31c2 (64) Conditioned on event X0 , by (70) and (55) log max(n, p)/n =: δn . [sent-958, score-0.228]

75 First we note that for ∆ ∈ Tn , we have on event E , ∆ 2 ≤ ∆ F = Mrn < 7 , 16k (66) 2 9 32 2 given (55): n > ( 16 · 2k )2 4C3 + 31c2 max (2|E|) log(n), Cbias 2S0,n log p . [sent-961, score-0.323]

76 By Proposition 26 and the fact that G(∆) ≤ G(0) = 0 on event E , we have the following: on event E , if G(∆) > 0, ∀∆ ∈ Tn then ∆ F < Mrn , given that ∆ ∈ Un (Θ0 ) \ (Tn ∪ Wn ). [sent-970, score-0.312]

77 Remark 27 Clearly we have on event E , by (64) (Θ0 )−1 − Σ0 F ≤ Θ0,D F ϕmin (Θ0 )ϕmin (Θ0 ) ≤ 32Cbias 2S0,n log p/n . [sent-976, score-0.228]

78 Now suppose n>( 16 9 2 32 · 2 ) C3 + 7c 2k 31c2 2 2 max 2|E| log max(n, p), Cbias 2S0,n log p , which clearly holds given (57). [sent-979, score-0.302]

79 Then in addition to the bound in (66), on event E ∩ X0 , we have Mrn < 7c/16, 3013 (68) ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN for rn as in (62). [sent-980, score-0.251]

80 Then, by Theorem 19, for the same M as therein, on event E ∩ X0 , we have Θn (E) − Θ0 F ≤ (M + 1) max 2|E| log max(n, p)/n, Cbias 2S0,n log(p)/n , given that sample bound in (55) is clearly satisﬁed. [sent-981, score-0.35]

81 We now proceed to bound Σn − Σ0 First note that by (68), we have on event E ∩ X0 for M > 7 ϕmin (Θn (E)) ≥ ϕmin (Θ0 ) − Θn − Θ0 2 ≥ ϕmin (Θ0 ) − Θn − Θ0 F given (56). [sent-982, score-0.183]

82 Now clearly on event E ∩ X0 , (58) holds by (56) and Σn (E) − Σ0 F ≤ Θn (E) − Θ0 F ϕmin (Θn (E))ϕmin (Θ0 ) < 2 Θn (E) − Θ0 c2 F . [sent-984, score-0.182]

83 Lemma 28 On event E , we have for Cbias , Θ0 , Θ0 as in Theorem 19, 2 0 ≤ R(Θ0 ) − R(Θ0 ) ≤ (32/(31c))2Cbias 2S0,n log p 2 2 ≤ (32/(31c))2 · rn /2 ≤ Mrn /8, 2n for rn as in (62), where the last inequality holds given that M ≥ 9/2(4C3 + 32/(31c2 )). [sent-990, score-0.39]

84 log |Θ0 + ∆0 | − log |Θ0 | = 1 0 (1 − v) d2 log |Θ0 + v∆0 |dv dv2 We now obtain an upper bound on B ≥ 0. [sent-1001, score-0.243]

85 Now, conditioned on event E ∩ X0 , following the same arguments around (65), we have ≤ δn offd(∆) ≤ δn tr ∆(Sn − Σ0 ) 1 ≤ MrnC3 where offd(∆) 0 2|E| offd(∆) F 2 2|E| log max(n, p)/n ≤ MC3 rn , ≤ 2|E| by deﬁnition, and rn is as deﬁned in (62). [sent-1008, score-0.513]

86 On event X0 , the following holds for τ = C3 log max{p,n} n Xi 2 2 −1 σ2 n i ≤ τ, 1 Xi /σi , X j /σ j − ρ0,i j n ≤ τ. [sent-1018, score-0.254]

87 k 2S0,n log(p)/n < √ 13 144σ2 max 4C3 + 12c2 σ2 min ≤ kc2 σ2 cσ2 k min min ≤ min , 48C3 σ2 13σ2 (48c2 σ2 C3 + 13)σ2 max max max min ≤ c . [sent-1025, score-0.785]

88 13σ2 max Suppose event E holds throughout this proof. [sent-1026, score-0.277]

89 Deﬁne for Rn as in expression (27), Q(Ω) := Rn (Ω) − Rn (Ω0 ) = tr(ΩΓn ) − log |Ω| − tr(Ω0 Γn ) + log |Ω0 | = tr (Ω − Ω0 )(Γn − Ψ0 ) − (log |Ω| − log |Ω0 |) + tr (Ω − Ω0 )Ψ0 . [sent-1036, score-0.514]

90 Indeed, by deﬁnition of ∆ ∈ Tn , we have ϕmin (Ω0 + ∆) ≻ 0 on event E ; thus ϕmin (Ω0 + (1 + ε)∆) ≥ ϕmin (Ω0 + ∆) − ε ∆ and ϕmin (Ω0 − ε∆) ≥ ϕmin (Ω0 ) − ε ∆ 2 2 > 0, > 12σ2 c/13 − ε ∆ min 2 >0 for ε > 0 that is sufﬁciently small. [sent-1050, score-0.237]

91 First we note that for ∆ ∈ Tn , we have on event E , ∆ 2≤ ∆ F = Mrn < 3019 3σ2 max , 8k F. [sent-1057, score-0.251]

92 (76) ¨ ¨ Z HOU , R UTIMANN , X U AND B UHLMANN 2 13 9 2 given (31): n > ( 8 · 2k )2 σ4 max 2|E|) log max(n, p),Cbias 2S0,n log p . [sent-1058, score-0.239]

93 We have max 4C3 + 12σ2 c2 3 min by (72) and (34) following Rothman et al. [sent-1059, score-0.176]

94 [2008] (see Page 502, proof of Theorem 1 therein): on event E , A ≥ ∆ 2 F/ 2 ϕmax (Ω0 ) + ∆ > ∆ 2 F/ 2σ4 max 2 2 c 1 3 + + k 13 8k 2 > ∆ 2 F 2k2 . [sent-1060, score-0.251]

95 9σ4 max (77) Now on event E ∩ X0 , for all ∆ ∈ Tn , we have by (74),(77), (76), and (75), G(∆) > ∆ 2 F 2k2 − 4C3 rn ∆ 9σ4 max = ∆ 2 F 2k2 1 − 4 ∆ F 9σmax = ∆ 2 F 2k2 13 1 4C3 + − 4 9σmax M 12σ2 c2 min Hence we have G(∆) > 0 for M (9σ4 /(2k2 )) 4C3 + 13/(12σ2 c2 ) sufﬁces. [sent-1061, score-0.495]

96 max min F − ∆ F 4C3 rn + large 13rn 12σ2 c2 min 13rn 12σ2 c2 min enough, . [sent-1062, score-0.406]

97 H IGH - DIMENSIONAL C OVARIANCE E STIMATION We can now apply the union bound to obtain: P max |Y j | ≥ t 1≤ j≤p c1 σ −nt 2 ≤ p √ exp nt 2c2 σ2 1 √ t πn nt 2 − log p . [sent-1101, score-0.25]

98 + log √ 2c2 σ2 2c1 σ 1 √ √ By choosing t = c1 σ 1 + a 2 log p/n, the right-hand side is bounded by ( π log ppa )−1 for a ≥ 0. [sent-1102, score-0.216]

99 High dimensional sparse covariance estimation via directed acyclic u u graphs. [sent-1258, score-0.213]

100 High dimensional inverse covariance matrix estimation via linear programming. [sent-1315, score-0.219]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gelato', 0.372), ('glasso', 0.309), ('cbias', 0.236), ('uhlmann', 0.236), ('utimann', 0.236), ('hou', 0.201), ('mrn', 0.2), ('ovariance', 0.193), ('stimation', 0.166), ('event', 0.156), ('tr', 0.149), ('igh', 0.148), ('zhou', 0.138), ('nodewise', 0.136), ('se', 0.128), ('offd', 0.127), ('si', 0.124), ('tn', 0.106), ('jk', 0.1), ('init', 0.1), ('sn', 0.097), ('max', 0.095), ('thresholding', 0.093), ('rothman', 0.09), ('frobenius', 0.09), ('lasso', 0.084), ('cdiag', 0.082), ('min', 0.081), ('covariance', 0.079), ('vi', 0.074), ('log', 0.072), ('diag', 0.071), ('en', 0.071), ('ar', 0.069), ('dv', 0.069), ('op', 0.069), ('rn', 0.068), ('regressions', 0.065), ('leibler', 0.062), ('theorem', 0.06), ('edge', 0.059), ('concentration', 0.058), ('estimator', 0.057), ('kullback', 0.055), ('un', 0.053), ('ij', 0.053), ('stands', 0.052), ('sparse', 0.051), ('meinshausen', 0.048), ('edges', 0.047), ('bickel', 0.046), ('dashdotted', 0.045), ('dashes', 0.045), ('mlow', 0.045), ('dimensional', 0.045), ('ta', 0.045), ('vec', 0.045), ('var', 0.044), ('peng', 0.042), ('levina', 0.042), ('cand', 0.039), ('isoprenoid', 0.039), ('lam', 0.039), ('hlmann', 0.038), ('estimation', 0.038), ('suppose', 0.037), ('remark', 0.037), ('entries', 0.036), ('mupp', 0.036), ('timann', 0.036), ('lemma', 0.036), ('norm', 0.036), ('regression', 0.035), ('triangles', 0.035), ('stand', 0.033), ('correlation', 0.033), ('eigenvalues', 0.031), ('rudelson', 0.031), ('breast', 0.03), ('bounds', 0.03), ('matrix', 0.03), ('mle', 0.03), ('oracle', 0.029), ('tuning', 0.029), ('fan', 0.029), ('risk', 0.028), ('nt', 0.028), ('geer', 0.028), ('bias', 0.028), ('proposition', 0.027), ('undirected', 0.027), ('inverse', 0.027), ('scad', 0.027), ('shuheng', 0.027), ('thresholded', 0.027), ('bound', 0.027), ('graphical', 0.026), ('circles', 0.026), ('sparsity', 0.026), ('holds', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

2 0.16823754 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez

Abstract: In this paper, we consider the Group Lasso estimator of the covariance matrix of a stochastic process corrupted by an additive noise. We propose to estimate the covariance matrix in a highdimensional setting under the assumption that the process has a sparse representation in a large dictionary of basis functions. Using a matrix regression model, we propose a new methodology for high-dimensional covariance matrix estimation based on empirical contrast regularization by a group Lasso penalty. Using such a penalty, the method selects a sparse set of basis functions in the dictionary used to approximate the process, leading to an approximation of the covariance matrix into a low dimensional space. Consistency of the estimator is studied in Frobenius and operator norms and an application to sparse PCA is proposed. Keywords: group Lasso, ℓ1 penalty, high-dimensional covariance estimation, basis expansion, sparsity, oracle inequality, sparse PCA

3 0.12173611 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

Abstract: We study graph estimation and density estimation in high dimensions, using a family of density estimators based on forest structured undirected graphical models. For density estimation, we do not assume the true distribution corresponds to a forest; rather, we form kernel density estimates of the bivariate and univariate marginals, and apply Kruskal’s algorithm to estimate the optimal forest on held out data. We prove an oracle inequality on the excess risk of the resulting estimator relative to the risk of the best forest. For graph estimation, we consider the problem of estimating forests with restricted tree sizes. We prove that ﬁnding a maximum weight spanning forest with restricted tree size is NP-hard, and develop an approximation algorithm for this problem. Viewing the tree size as a complexity parameter, we then select a forest using data splitting, and prove bounds on excess risk and structure selection consistency of the procedure. Experiments with simulated data and microarray data indicate that the methods are a practical alternative to Gaussian graphical models. Keywords: kernel density estimation, forest structured Markov network, high dimensional inference, risk consistency, structure selection consistency

4 0.10552069 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

Abstract: We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the ﬁeld of statistics, is a simpliﬁed model that provides a laboratory for studying complex procedures. Keywords: high-dimensional inference, multi-task learning, sparsity, normal means, minimax estimation

5 0.078224659 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

Abstract: In the efﬁcient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is ﬁxed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modiﬁed algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never ﬁnd the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a ﬁxed prior. Keywords: convergence rates, efﬁcient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization

6 0.077475414 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

7 0.077464037 101 jmlr-2011-Variable Sparsity Kernel Learning

8 0.068998657 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

9 0.06347008 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

10 0.058326326 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

11 0.058086541 6 jmlr-2011-A Simpler Approach to Matrix Completion

12 0.054538984 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

13 0.045826349 59 jmlr-2011-Learning with Structured Sparsity

14 0.044128891 55 jmlr-2011-Learning Multi-modal Similarity

15 0.044013713 91 jmlr-2011-The Sample Complexity of Dictionary Learning

16 0.043125536 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

17 0.042388756 54 jmlr-2011-Learning Latent Tree Graphical Models

18 0.041526105 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

19 0.03853951 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

20 0.038450576 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.243), (1, -0.015), (2, 0.127), (3, 0.105), (4, 0.114), (5, 0.006), (6, -0.019), (7, -0.018), (8, -0.059), (9, 0.177), (10, 0.153), (11, -0.139), (12, -0.05), (13, -0.264), (14, -0.034), (15, 0.096), (16, 0.117), (17, -0.009), (18, -0.057), (19, 0.065), (20, 0.093), (21, -0.026), (22, -0.067), (23, -0.055), (24, -0.029), (25, -0.028), (26, 0.149), (27, 0.208), (28, 0.126), (29, 0.208), (30, 0.119), (31, -0.075), (32, -0.064), (33, -0.108), (34, 0.077), (35, -0.053), (36, -0.074), (37, -0.124), (38, 0.069), (39, -0.015), (40, -0.045), (41, 0.032), (42, -0.026), (43, -0.048), (44, 0.051), (45, -0.025), (46, 0.102), (47, 0.024), (48, 0.036), (49, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94124472 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

2 0.73796773 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

Author: Jérémie Bigot, Rolando J. Biscay, Jean-Michel Loubes, Lillian Muñiz-Alvarez

3 0.44997975 35 jmlr-2011-Forest Density Estimation

Author: Han Liu, Min Xu, Haijie Gu, Anupam Gupta, John Lafferty, Larry Wasserman

4 0.41258797 97 jmlr-2011-Union Support Recovery in Multi-task Learning

Author: Mladen Kolar, John Lafferty, Larry Wasserman

5 0.39485914 6 jmlr-2011-A Simpler Approach to Matrix Completion

Author: Benjamin Recht

Abstract: This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct an unknown low-rank matrix. These results improve on prior work by Cand` s and e Recht (2009), Cand` s and Tao (2009), and Keshavan et al. (2009). The reconstruction is accome plished by minimizing the nuclear norm, or sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying matrix satisﬁes a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum information theory. Keywords: matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random matrices, operator Chernoff bound, compressed sensing

6 0.35944673 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

7 0.34982789 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

8 0.33041245 101 jmlr-2011-Variable Sparsity Kernel Learning

9 0.31982216 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

10 0.30736315 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

11 0.28471074 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

12 0.28158659 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

13 0.2691972 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

14 0.25800514 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

15 0.25013101 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

16 0.2309064 91 jmlr-2011-The Sample Complexity of Dictionary Learning

17 0.22127575 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

18 0.21662244 58 jmlr-2011-Learning from Partial Labels

19 0.21504016 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

20 0.21385077 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.067), (9, 0.024), (10, 0.016), (24, 0.019), (31, 0.066), (32, 0.011), (41, 0.025), (60, 0.01), (73, 0.019), (78, 0.637), (90, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99000144 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Author: Shuheng Zhou, Philipp Rütimann, Min Xu, Peter Bühlmann

2 0.98673624 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning

Author: Harm van Seijen, Shimon Whiteson, Hado van Hasselt, Marco Wiering

Abstract: This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efﬁciency of model-based methods with the space efﬁciency of modelfree methods. Best-match learning works by approximating the solution to a set of best-match equations, which combine a sparse model with a model-free Q-value function constructed from samples not used by the model. We prove that, unlike regular sparse model-based methods, bestmatch learning is guaranteed to converge to the optimal Q-values in the tabular case. Empirical results demonstrate that best-match learning can substantially outperform regular sparse modelbased methods, as well as several model-free methods that strive to improve the sample efﬁciency of temporal-difference methods. In addition, we demonstrate that best-match learning can be successfully combined with function approximation. Keywords: reinforcement learning, on-line learning, temporal-difference methods, function approximation, data reuse

3 0.97568518 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

Author: Shai Shalev-Shwartz, Ambuj Tewari

Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The ﬁrst method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity

4 0.94329512 22 jmlr-2011-Differentially Private Empirical Risk Minimization

Author: Kamalika Chaudhuri, Claire Monteleoni, Anand D. Sarwate

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or ﬁnancial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classiﬁers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy deﬁnition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classiﬁcation. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classiﬁers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacypreserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance. Keywords: privacy, classiﬁcation, optimization, empirical risk minimization, support vector machines, logistic regression

5 0.85149878 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

Author: Şeyda Ertekin, Cynthia Rudin

Abstract: We demonstrate that there are machine learning algorithms that can achieve success for two separate tasks simultaneously, namely the tasks of classiﬁcation and bipartite ranking. This means that advantages gained from solving one task can be carried over to the other task, such as the ability to obtain conditional density estimates, and an order-of-magnitude reduction in computational time for training the algorithm. It also means that some algorithms are robust to the choice of evaluation metric used; they can theoretically perform well when performance is measured either by a misclassiﬁcation error or by a statistic of the ROC curve (such as the area under the curve). Speciﬁcally, we provide such an equivalence relationship between a generalization of Freund et al.’s RankBoost algorithm, called the “P-Norm Push,” and a particular cost-sensitive classiﬁcation algorithm that generalizes AdaBoost, which we call “P-Classiﬁcation.” We discuss and validate the potential beneﬁts of this equivalence relationship, and perform controlled experiments to understand P-Classiﬁcation’s empirical performance. There is no established equivalence relationship for logistic regression and its ranking counterpart, so we introduce a logistic-regression-style algorithm that aims in between classiﬁcation and ranking, and has promising experimental performance with respect to both tasks. Keywords: supervised classiﬁcation, bipartite ranking, area under the curve, rank statistics, boosting, logistic regression

6 0.79892188 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

7 0.79281592 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation

8 0.77895862 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

9 0.76109779 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

10 0.7584399 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

11 0.73921835 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.73122525 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

13 0.70535767 36 jmlr-2011-Generalized TD Learning

14 0.70433074 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

15 0.70408368 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

16 0.70349312 104 jmlr-2011-X-Armed Bandits

17 0.70076716 91 jmlr-2011-The Sample Complexity of Dictionary Learning

18 0.6963076 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

19 0.69286543 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm

20 0.69082332 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning