jmlr jmlr2008 jmlr2008-26 knowledge-graph by maker-knowledge-mining

26 jmlr-2008-Consistency of Trace Norm Minimization

Source: pdf

Author: Francis R. Bach

Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufﬁcient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulﬁlled. Keywords: convex optimization, singular value decomposition, trace norm, consistency

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufﬁcient conditions for rank consistency of trace norm minimization with the square loss. [sent-5, score-0.446]

2 We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulﬁlled. [sent-6, score-0.379]

3 Keywords: convex optimization, singular value decomposition, trace norm, consistency 1. [sent-7, score-0.409]

4 When learning on rectangular matrices, the rank is a natural extension of the cardinality, and the sum of singular values, also known as the trace norm or the nuclear norm, is the natural extension of the 1 -norm; indeed, as the 1 -norm is the convex envelope of the 0 -norm on the unit ball (i. [sent-13, score-0.553]

5 , the largest lower bounding convex function) (Boyd and Vandenberghe, 2003), the trace norm is the convex envelope of the rank over the unit ball of the spectral norm (Fazel et al. [sent-15, score-0.411]

6 In this paper, we consider the rank consistency of trace norm regularization with the square loss, that is, if the data were actually generated by a low-rank matrix, will the matrix and its rank be consistently estimated? [sent-25, score-0.583]

7 In Section 4, we provide necessary and sufﬁcient conditions for the rank consistency that are extensions of corresponding results for the Lasso (Yuan and Lin, 2007; Zhao and Yu, 2006; Zou, 2006) and the group Lasso (Bach, 2008). [sent-26, score-0.238]

8 In Appendix A and B, we review and derive relevant tools and results regarding perturbation of singular values as well as the trace norm. [sent-42, score-0.336]

9 (1) must lead to diagonal solutions (indeed the minimum trace norm matrix with ﬁxed diagonal is the corresponding diagonal matrix, which is a consequence of Lemma 20 and Proposition 21) and for a diagonal matrix the trace norm is simply the 1 norm of the diagonal. [sent-63, score-0.462]

10 Similarly, the optimal W must share the same block-diagonal form, and its singular values are exactly the norms of each block, that is, the trace norm is indeed the sum of the norms of each group. [sent-76, score-0.48]

11 Note that the Lasso and group Lasso can be seen as special cases where the singular vectors are ﬁxed. [sent-79, score-0.279]

12 However, the main difﬁculty in analyzing trace norm regularization, as well as the main reason for it use, is that singular vectors are not ﬁxed and those can often be seen as implicit features learned by the estimation procedure (Srebro et al. [sent-80, score-0.414]

13 , n we have a linear prediction model, where the loading matrix W is non trivial and rank-deﬁcient, the goal being to estimate this rank (as well as the matrix itself). [sent-118, score-0.279]

14 We let denote W = U Diag(s)V its singular value decomposition, with U ∈ R p×r , V ∈ Rq×r , and r ∈ (0, min{p, q}) denotes the rank of W. [sent-119, score-0.392]

15 , ny } is sampled uniformly, then if nx , ny and n tend to inﬁnity, then −1/2 −1/2 (A2) and (A3) are satisﬁed with Σmm = Σyy ⊗ Σxx and ζn = n−1/2 + nx + ny . [sent-156, score-0.396]

16 (1), that we will constantly use in the paper: 1022 C ONSISTENCY OF T RACE N ORM M INIMIZATION Proposition 3 The matrix W with singular value decomposition W = U Diag(s)V positive singular values s) is optimal for the problem in Eq. [sent-159, score-0.564]

17 ˆ ˆ This implies notably that W and ΣmmW − ΣMz have simultaneous singular value decompositions, and the largest singular values are less than λn , and exactly equal to λn for the corresponding strictly positive singular values of W . [sent-161, score-0.794]

18 We also consider the rank conˆ sistency, that is, we want that P(rank(W ) = rank(W)) tends to zero as n tends to inﬁnity. [sent-166, score-0.315]

19 1); c) if λn tends to zero exactly at rate n−1/2 , then the estimator is consistent with error O p (n−1/2 ) but the probability of estimating the correct rank is converging to a limit in (0, 1) (see Section 4. [sent-169, score-0.288]

20 2); d) if λn tends to zero more slowly than n−1/2 , then the estimate is consistent with error O p (λn ) and its rank consistency depends on speciﬁc consistency conditions detailed in Section 4. [sent-170, score-0.412]

21 ˆ n mm We now consider the corresponding rank consistency results, when λ n goes to zero faster than n−1/2 . [sent-190, score-0.533]

22 3) states that for such regularization parameter, the solution has rank strictly greater than r with probability tending to one and can thus not be rank consistent: ˆ Proposition 7 Assume (A1-3). [sent-192, score-0.397]

23 If n1/2 λn tends to a limit 1/2 (W − W) converges in distribution to the unique global minimizer of ˆ λ0 > 0, then n 1 min vec(∆) Σmm vec(∆) − tr∆ A + λ0 trU ∆V + U⊥ ∆V⊥ p×q 2 ∆∈R ∗ , where vec(A) ∈ R pq is normally distributed with mean zero and covariance matrix σ 2 Σmm . [sent-200, score-0.265]

24 If n1/2 λn tends to a limit λ0 > 0, then the probability that the rank ˆ of W is different from the rank of W is converging to P( Λ − λ−1 Θ 2 1) ∈ (0, 1) where Λ ∈ 0 (p−r)×(q−r) is deﬁned in Eq. [sent-202, score-0.376]

25 3) and Θ ∈ R (p−r)×(q−r) has a normal distribution with R mean zero and covariance matrix σ2 (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 . [sent-204, score-0.398]

26 ˆ The previous proposition ensures that the estimate W cannot be rank consistent with this decay of the regularization parameter. [sent-205, score-0.282]

27 The condition U⊥ ∆V⊥ = 0 is thus necessary for rank consistency when λn n1/2 tends to inﬁnity while λn tends to zero. [sent-221, score-0.364]

28 Lemma 11 Assume Σmm is invertible, and W = U Diag(s)V is the singular value decomposition of W. [sent-223, score-0.28]

29 Then the unique global minimizer of vec(∆) Σmm vec(∆) + trU ∆V + U⊥ ∆V⊥ ∗ satisﬁes U⊥ ∆V⊥ = 0 if and only if (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 (V ⊗ U) vec(I) mm 1. [sent-224, score-0.625]

30 2 This leads to consider the matrix Λ ∈ R(p−r)×(q−r) deﬁned as vec(Λ) = (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 (V ⊗ U) vec(I) , mm (3) and the two weak and strict consistency conditions: Λ 2 1, (4) Λ 2 < 1. [sent-225, score-0.698]

31 (5) is sufﬁcient for rank consistency when n 1/2 λn tends to inﬁnity, while the condition Eq. [sent-228, score-0.288]

32 (4) is necessary for the existence of a sequence λ n such that the estimate is both consistent and rank consistent (which is a stronger result than restricting λ n to be tending to zero slower than n−1/2 ). [sent-229, score-0.261]

33 This is due to the fact that the ﬁrst r singular vectors U and V of W + λn ∆ are not ﬁxed; indeed, the r ﬁrst singular vectors (i. [sent-247, score-0.506]

34 This is to be contrasted with the adaptive version where asymptotically the ﬁrst order expansion has constant singular vectors (see Section 5). [sent-250, score-0.352]

35 Finally, in this paper, we have only proved whether the probability of correct rank selection tends to zero or one. [sent-251, score-0.239]

36 In this situation, the singular values of the diagonal matrix W = Diag(w) are the norms of the diagonal blocks, while the left singular vectors are equal to the normalized versions of the block (the signs for the Lasso). [sent-262, score-0.57]

37 We can then compute the singular value decomposition in closed form as / I U = (Diag(wi 0 wi )i r , V = 0 and s = ( w j ) j r . [sent-286, score-0.28]

38 We can put I 0 I −1 H (V ⊗ U) vec(I) = (Σ−1 ) these singular vectors into Eq. [sent-288, score-0.253]

39 Thus, for the group Lasso, we ﬁnally obtain: Λ = Diag ((Σ−1 )Jc Jc )−1 (Σ−1 )Jc ,J ηJ xx xx = 2 (Σxx )Jc J (Σxx )−1 ηJ J,J Diag 2 2 by the partitioned matrices inversion lemma, = max Σxi xJ Σ−1 J ηJ . [sent-293, score-0.261]

40 Moreover the condition Λ 2 1 is exactly the one for the group Lasso (Bach, 2008), where the pattern consistency is replaced by the consistency for the number of non zero groups. [sent-298, score-0.253]

41 More precisely, we consider the least-square estimate ˆ mm ˆ ˆ vec(WLS ) = Σ−1 vec(ΣMz ). [sent-305, score-0.297]

42 We have the following well known result for least-square regression: ˆ ˆ mm Lemma 14 Assume (A1-3). [sent-306, score-0.297]

43 mm ˆ We consider the singular value decomposition of WLS = ULS Diag(sLS )VLS , where sLS 0. [sent-308, score-0.577]

44 With ˆ probability tending to one, min{p, q} singular values are strictly positive (i. [sent-309, score-0.332]

45 We complete the singular values sLS ∈ Rmin{p,q} by n−1/2 to reach dimensions p and q (we keep the same notation for both dimensions for simplicity). [sent-313, score-0.253]

46 If γ ∈ (0, 1], n1/2 λn tends to 0 and λn n1/2+γ/2 tends to inﬁnity, then ˆ any global minimizer WA of 1 n ∑ (zi − trW Mi )2 + λn AW B 2n i=1 ∗ ˆ is consistent and rank consistent. [sent-317, score-0.349]

47 3, we illustrate the previous theorem on the singular subspaces at rate O p (n synthetic examples. [sent-321, score-0.253]

48 In particular, we exhibit some singular behavior for the limiting case γ = 1. [sent-322, score-0.253]

49 Because the dual norm of the trace norm is the spectral norm (see Appendix B), the dual is easily obtained as 1 (8) max − vec(Q − λV ) Σ−1 vec(Q − λV ). [sent-333, score-0.35]

50 , that depends only on singular values of V , equal to B(V ) = min{p,q} b(si (V )) where b(s) = (1 + s) log(1 + s) + (1 − s) log(1 − s) if |s| 1 and +∞ otherwise ∑i=1 (si (V ) denotes the i-th largest singular values of V ). [sent-349, score-0.506]

51 Derivatives of spectral functions Note that derivatives of spectral functions of the form B(W ) = min{p,q} b(si (W )), where b is an even twice differentiable function such that b(0) = b (0) = 0, are ∑i=1 easily calculated as follows; Let U Diag(s)V be the singular value decomposition of W . [sent-358, score-0.346]

52 We then have the following Taylor expansion (Lewis and Sendov, 2002): B(W + ∆) = B(W ) + tr∆ U Diag(b (si ))V + 1 p q b (si ) − b (s j ) ∑ ∑ si − s j (ui ∆v j )2 , 2 i=1 j=1 where the vector of singular values is completed by zeros, and si = s j . [sent-359, score-0.331]

53 C ONSISTENCY OF T RACE N ORM M INIMIZATION consistent − adaptive (γ=1/2) consistent − non adaptive consistent − adaptive (γ=1) 2. [sent-372, score-0.372]

54 5 0 5 −log(λ) 0 10 singular values singular values 2 1 0 5 −log(λ) 0 5 −log(λ) 10 1 0 10 2. [sent-379, score-0.506]

55 5 inconsistent − adaptive (γ=1/2) inconsistent − non adaptive 2. [sent-385, score-0.297]

56 5 singular values 0 −5 singular values 3 singular values singular values 3 3 2. [sent-387, score-1.012]

57 5 0 5 −log(λ) 10 0 −5 0 5 −log(λ) 10 15 Figure 2: Examples of paths of singular values for Λ 2 = 0. [sent-390, score-0.253]

58 78 > 1 (inconsistent, bottom) rank selection: regular trace norm penalization (left) and adaptive penalization with γ = 1/2 (center) and γ = 1 (right). [sent-392, score-0.422]

59 Estimated singular values are plotted in plain, while population singular values are dotted. [sent-393, score-0.506]

60 1 ˆ In Figure 2, we plot regularization paths for n = 103 , by showing the singular values of W compared to the singular values of W, in two particular situations (Eq. [sent-407, score-0.546]

61 (5) satisﬁed and not satisﬁed), for the regular trace norm regularization and the adaptive versions, with γ = 1/2 and γ = 1. [sent-409, score-0.323]

62 Note that in the consistent case (top), the singular values and their cardinalities are well jointly estimated, both for the non adaptive version (as predicted by Theorem 12) and the adaptive 1. [sent-410, score-0.493]

63 However in the inconsistent case, the non adaptive regularizations scheme (bottom left) cannot achieve regular consistency together with rank consistency (Theorem 13), while the adaptive schemes can. [sent-416, score-0.584]

64 Note the particular behavior of the limiting case γ = 1, which still achieves both consistencies but with a singular behavior for large λ. [sent-417, score-0.253]

65 However, for the adaptive versions, it does, still with a somewhat singular behavior of the limiting case γ = 1. [sent-425, score-0.331]

66 Conclusion We have presented an analysis of the rank consistency for the penalization by the trace norm, and derived general necessary and sufﬁcient conditions. [sent-430, score-0.295]

67 Tools for Analysis of Singular Value Decomposition In this appendix, we review and derive precise results regarding singular value decompositions. [sent-436, score-0.253]

68 8 0 5 −log(λ) 10 consistent − adaptive (γ=1) 2 1 P(correct rank) 0 −log(λ) consistent − adaptive (γ=1/2) 2 consistent − adaptive (γ=1/2) 1 P(correct rank) 0 0 −2 0. [sent-448, score-0.315]

69 1033 BACH inconsistent − non adaptive inconsistent − non adaptive 1. [sent-451, score-0.354]

70 6 5 inconsistent − adaptive (γ=1/2) 2 log(RMS) P(correct rank) inconsistent − adaptive (γ=1/2) 1 0 −log(λ) −2 −4 −6 −5 10 −log(λ) 0 5 −log(λ) 10 Figure 4: Synthetic example where consistency condition in Eq. [sent-466, score-0.313]

71 Note that when a singular value si is simple, that is, does not coalesce with any other singular values, then the vectors ui and vi are uniquely deﬁned up to simultaneous sign ﬂips, that is, only the matrix ui vi is unique. [sent-481, score-0.576]

72 However, when some singular values coalesce, then the corresponding singular vectors are deﬁned up to a rotation, and thus in general care must be taken and considering isolated singular vectors should be avoided (Stewart and Sun, 1990). [sent-482, score-0.759]

73 All tools presented in this appendix are robust to the particular choice of the singular vectors. [sent-483, score-0.253]

74 1 Jordan-Wielandt Matrix We use the fact that singular values of W can be obtained from the eigenvalues of the Jordan0 W ¯ ∈ R(p+q)×(p+q) (Stewart and Sun, 1990). [sent-485, score-0.278]

75 , r, where si are the (strictly positive) singular values of W , ui ui 1 1 with eigenvectors √2 and √2 where ui , vi are the left and right associated singular vi −vi vectors. [sent-489, score-0.545]

76 2iπ C λI − W 2iπ C λI − W We let denote s1 and sr the largest and smallest strictly positive singular values of W ; if ∆ 2 < sr /2, then W + ∆ has r singular values strictly greater than sr /2 and the remaining ones are strictly less ¯ than sr /2 (Stewart and Sun, 1990). [sent-501, score-1.139]

77 Proposition 17 Assume W has rank r and ∆ 2 < sr /4 where sr is the smallest positive singular ¯ ¯ value of W . [sent-508, score-0.656]

78 ¯ The variations of Π(W ) translates immediately into variations of the singular projections UU and VV . [sent-512, score-0.253]

79 2 2 r Similarly, when restricted to the small singular values, the ﬁrst order expansion is (I −UU )∆(I − VV ), with error term bounded in spectral norm by s4r ∆ 2 . [sent-514, score-0.364]

80 Those results lead to the following 2 proposition that gives a local sufﬁcient condition for rank(W + ∆) > rank(W ): 1037 BACH Proposition 18 Assume W has rank r < min{p, q} with ordered singular value decomposition W = U Diag(s)V . [sent-515, score-0.495]

81 2 s3 r Proof The trace norm of W + ∆ ∗ may be divided into the sum of the r largest and the sum of the remaining singular values. [sent-530, score-0.414]

82 For the ﬁrst r singular values, we need 2 ¯ ¯ to upperbound the second derivative of the sum of the r largest eigenvalues of W + ∆ with strictly positive eigengap, which leads to the given bound by using the same Cauchy residue technique described in Appendix A. [sent-532, score-0.313]

83 n We can thus compute the squared Frobenius norm: 1 n ˆ ˆ ∑ vec(Mk ) vec(Mk ) − Σyy ⊗ Σxx n k=1 = = 2 F 1 ˜˜ ˜˜ ˜˜ ˜˜ tr Diag(vec(S − n/nx ny ))(Y Y ⊗ X X ) Diag(vec(S − n/nx ny ))(Y Y ⊗ X X ) n2 1 ˜˜ ˜˜ ˜˜ ˜˜ ∑ (Si j − n/nx ny )(Y Y ⊗ X X )i j,i j (Si j − n/nx ny )(Y Y ⊗ X X )i j,i j . [sent-539, score-0.291]

84 n2 i, j,i , j We have, by properties of sampling without replacement (Hoeffding, 1963): E(Si j − n/nx ny )(Si j − n/nx ny ) = n/nx ny (1 − n/nx ny ) if (i, j) = (i , j ), 1 E(Si j − n/nx ny )(Si j − n/nx ny ) = −n/nx ny (1 − n/nx ny ) if (i, j) = (i , j ). [sent-540, score-0.512]

85 nx ny − 1 1039 BACH This implies E( 1 n ˆ ˆ F ˜ ˜ ∑ vec(Mk ) vec(Mk ) − Σyy ⊗ Σxx 2 |X, Y ) n k=1 1 1 ˜˜ ˜˜ i ˜˜ ˜˜ i = ∑(Y Y ⊗ X X )2j,i j − (nx ny − 1)nx ny n ∑ (Y Y ⊗ X X )2j,i nx ny n i, j (i, j)=(i , j ) j 2 yj ˜ nx ny n ∑ i, j 4 xi 4 . [sent-541, score-0.626]

86 2 Proof of Proposition 4 ˆ mm ˆ We may ﬁrst restrict minimization over the ball {W, W ∗ Σ−1 ΣMz ∗ } because the optimum −1 Σ . [sent-556, score-0.297]

87 By Proposition 18 in Appendix B, if 4nsr ∆ 2 < U⊥ ∆V⊥ 2 , then mm 2 ˆ rank(W ) > r. [sent-562, score-0.297]

88 By the dominated convergence mm 2 theorem, f (C) converges to one when C → ∞. [sent-564, score-0.327]

89 By the asymptotic normality result, P( 0 ∃n0 > 0 such that ∀n > −1/2 4C0 n0 , P( sr −1/2 the proof, because P( 4nsr ˆ ∆ 2 2 ˆ ∆ ⊥ 2 sr 2 2 ˆ < U⊥ ∆V⊥ 2 ) > f (C0 ) − ε/2 > 1 − ε, which concludes ˆ < U⊥ ∆V⊥ 2 ) P( −1/2 4C0 sr ˆ ∆ 2 2 ˆ < U⊥ ∆V⊥ 2 ) as soon as n > C0 . [sent-566, score-0.416]

90 We consider the following events: ˆ E0 = {rank(W ) = r} −1/2 ˆ E1 = { n ∆ 2 < sr /2} E2 = 4n−1/2 ˆ ∆ sr 2 2 ˆ < U⊥ ∆V⊥ 2 . [sent-574, score-0.264]

91 For any A, when η tends to zero, the indicator function 1 U ∆(A)V⊥ 2 η converges to 1 U ∆(A)V⊥ 2 =0 , which is equal to 1 Λ(A) 2 λ0 , where ⊥ ⊥ vec(Λ(A)) = (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 ((V ⊗ U) vec(I)−vec(A)) . [sent-579, score-0.403]

92 mm 1041 BACH By the dominated convergence theorem, P( U⊥ ∆(A)V⊥ a = P( Λ(A) 2 2 η) converges to λ0 ), which is the proposed limit. [sent-580, score-0.327]

93 A sufﬁcient condition for rank consistency ˆ = USV the singular value decomposition of W and we let denote ˆ is the following: we let denote W Uo and Vo the singular vectors corresponding to all but the r largest singular values. [sent-586, score-0.998]

94 Since we ˆ have simultaneous singular value decompositions, a sufﬁcient condition is that rank(W ) r and ˆ ˆ mm (W − W) − ΣMε Vo < λn (1 − η). [sent-587, score-0.55]

95 7 Proof of Proposition 11 The optimal ∆ ∈ R p×q should be such that U⊥ ∆V⊥ has low rank, where U⊥ ∈ R p×(p−r) and V⊥ ∈ Rq×(q−r) are orthogonal complements of the singular vectors U and V. [sent-599, score-0.276]

96 We can solve explicitly for ∆ and Λ which leads to vec(Λ) = (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 (V ⊗ U) vec(I) , mm and vec(∆) = −Σ−1 vec(UV − U⊥ ΛV⊥ ). [sent-602, score-0.594]

97 From the regular consistency, the rank of W is, with probability tending to one, larger than r (because the rank is lower semi-continuous function). [sent-614, score-0.366]

98 We let denote W = USV the singular value decomposition of W and we let denote Uo and Vo the singular vectors corresponding to all but the r largest singular values. [sent-616, score-0.786]

99 Since we have simultaneous singular value decompositions, we simply need to show that, ˆ ˆ ˆ Uo Σmm (W − W) − ΣMε Vo 2 < λn with probability tending to one. [sent-617, score-0.297]

100 10 Proof of Theorem 15 o o r r We let denote ULS and VLS the ﬁrst r columns of ULS and VLS and ULS and VLS the remaining columns; we also denote sr the corresponding ﬁrst r singular values and so the remaining singular values. [sent-633, score-0.638]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vec', 0.688), ('mm', 0.297), ('singular', 0.253), ('uls', 0.225), ('diag', 0.153), ('rank', 0.139), ('sr', 0.132), ('lasso', 0.125), ('xx', 0.104), ('nx', 0.102), ('orm', 0.095), ('trw', 0.095), ('bach', 0.092), ('yy', 0.09), ('trace', 0.083), ('inimization', 0.081), ('norm', 0.078), ('adaptive', 0.078), ('tends', 0.076), ('proposition', 0.076), ('uo', 0.075), ('consistency', 0.073), ('onsistency', 0.073), ('ls', 0.07), ('vo', 0.069), ('mz', 0.068), ('sls', 0.068), ('race', 0.066), ('ny', 0.064), ('awb', 0.061), ('vls', 0.061), ('mi', 0.06), ('directional', 0.058), ('non', 0.057), ('pq', 0.05), ('uu', 0.05), ('mk', 0.049), ('abernethy', 0.048), ('tending', 0.044), ('regular', 0.044), ('inconsistent', 0.042), ('srebro', 0.041), ('tru', 0.041), ('regularization', 0.04), ('si', 0.039), ('int', 0.038), ('yuan', 0.038), ('strictly', 0.035), ('rq', 0.035), ('tr', 0.035), ('subdifferential', 0.035), ('fazel', 0.034), ('recht', 0.034), ('norms', 0.033), ('rr', 0.033), ('collaborative', 0.033), ('spectral', 0.033), ('minimizer', 0.031), ('matrix', 0.031), ('zou', 0.031), ('uv', 0.031), ('invertibility', 0.031), ('converges', 0.03), ('vv', 0.028), ('martingale', 0.027), ('ro', 0.027), ('uouo', 0.027), ('vovo', 0.027), ('invertible', 0.027), ('decomposition', 0.027), ('consistent', 0.027), ('matrices', 0.027), ('group', 0.026), ('log', 0.026), ('jk', 0.025), ('eigenvalues', 0.025), ('lim', 0.024), ('zero', 0.024), ('projector', 0.023), ('supn', 0.023), ('wls', 0.023), ('orthogonal', 0.023), ('rms', 0.023), ('covariance', 0.023), ('normal', 0.023), ('converging', 0.022), ('boyd', 0.022), ('asymptotically', 0.021), ('stewart', 0.021), ('loading', 0.021), ('jc', 0.021), ('borwein', 0.02), ('eigengap', 0.02), ('oo', 0.02), ('rennie', 0.02), ('trv', 0.02), ('usv', 0.02), ('zhao', 0.02), ('vandenberghe', 0.02), ('concludes', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999833 26 jmlr-2008-Consistency of Trace Norm Minimization

Author: Francis R. Bach

2 0.11238918 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufﬁcient conditions for the consistency of group Lasso under practical assumptions, such as model misspeciﬁcation. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this inﬁnite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisﬁed. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators

3 0.11086307 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui

Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefﬁcients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semideﬁnite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefﬁcients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufﬁcient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artiﬁcial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso

4 0.085994981 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont

Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our ﬁrst algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s ﬁrst order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data

5 0.077270955 56 jmlr-2008-Magic Moments for Structured Output Prediction

Author: Elisa Ricci, Tijl De Bie, Nello Cristianini

Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efﬁciently computing the ﬁrst two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound

6 0.064420253 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

7 0.043950912 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

8 0.043907661 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

9 0.039017823 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

10 0.037482589 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes

11 0.033745389 92 jmlr-2008-Universal Multi-Task Kernels

12 0.033164449 86 jmlr-2008-SimpleMKL

13 0.030969987 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

14 0.029521996 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

15 0.029405743 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

16 0.027595421 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

17 0.027104951 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

18 0.025897954 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

19 0.025714546 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

20 0.025308874 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, -0.063), (2, -0.095), (3, 0.015), (4, 0.121), (5, -0.033), (6, -0.059), (7, 0.085), (8, -0.163), (9, -0.103), (10, 0.011), (11, -0.11), (12, -0.109), (13, 0.183), (14, 0.222), (15, -0.203), (16, 0.009), (17, -0.063), (18, -0.291), (19, 0.025), (20, 0.098), (21, 0.031), (22, 0.086), (23, 0.002), (24, 0.037), (25, 0.029), (26, -0.052), (27, 0.152), (28, -0.114), (29, -0.013), (30, -0.067), (31, 0.155), (32, -0.061), (33, -0.033), (34, 0.085), (35, -0.004), (36, -0.096), (37, 0.061), (38, -0.04), (39, 0.049), (40, -0.044), (41, -0.069), (42, 0.072), (43, -0.072), (44, 0.025), (45, 0.027), (46, 0.044), (47, 0.044), (48, 0.066), (49, 0.058)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96877456 26 jmlr-2008-Consistency of Trace Norm Minimization

Author: Francis R. Bach

2 0.73689836 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

3 0.43408626 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont

4 0.41912735 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui

5 0.34031743 56 jmlr-2008-Magic Moments for Structured Output Prediction

Author: Elisa Ricci, Tijl De Bie, Nello Cristianini

6 0.32222921 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

7 0.20305113 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

8 0.18916909 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

9 0.17676616 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

10 0.16832836 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine

11 0.14996465 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

12 0.14439447 57 jmlr-2008-Manifold Learning: The Price of Normalization

13 0.13847901 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

14 0.13489473 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

15 0.12859392 92 jmlr-2008-Universal Multi-Task Kernels

16 0.1230628 52 jmlr-2008-Learning from Multiple Sources

17 0.11700856 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models

18 0.11495373 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

19 0.11369272 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines

20 0.11252119 86 jmlr-2008-SimpleMKL

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (40, 0.023), (54, 0.038), (58, 0.106), (61, 0.419), (66, 0.102), (76, 0.017), (78, 0.017), (88, 0.052), (92, 0.026), (94, 0.036), (99, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84007162 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs

Author: Xianchao Xie, Zhi Geng

Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is ﬁrst decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efﬁciency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning

same-paper 2 0.82030141 26 jmlr-2008-Consistency of Trace Norm Minimization

Author: Francis R. Bach

3 0.49682552 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

Author: Zongming Ma, Xianchao Xie, Zhi Geng

Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning

4 0.44241995 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)

Author: Jiji Zhang

Abstract: Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One prominent approach to tackling this problem is based on causal Bayesian networks, using directed acyclic graphs as causal diagrams to relate post-intervention probabilities to pre-intervention probabilities that are estimable from observational data. However, such causal diagrams are seldom fully testable given observational data. In consequence, many causal discovery algorithms based on data-mining can only output an equivalence class of causal diagrams (rather than a single one). This paper is concerned with causal reasoning given an equivalence class of causal diagrams, represented by a (partial) ancestral graph. We present two main results. The ﬁrst result extends Pearl (1995)’s celebrated do-calculus to the context of ancestral graphs. In the second result, we focus on a key component of Pearl’s calculus—the property of invariance under interventions, and give stronger graphical conditions for this property than those implied by the ﬁrst result. The second result also improves the earlier, similar results due to Spirtes et al. (1993). Keywords: ancestral graphs, causal Bayesian network, do-calculus, intervention

5 0.4354777 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

6 0.3934629 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)

7 0.37063479 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

8 0.35137075 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models

9 0.34769717 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

10 0.33656621 9 jmlr-2008-Active Learning by Spherical Subdivision

11 0.3363958 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models

12 0.32920614 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

13 0.31904131 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

14 0.30782434 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)

15 0.30576494 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

16 0.30364621 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

17 0.30341396 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)

18 0.29699165 57 jmlr-2008-Manifold Learning: The Price of Normalization

19 0.29597974 56 jmlr-2008-Magic Moments for Structured Output Prediction

20 0.29562399 83 jmlr-2008-Robust Submodular Observation Selection