jmlr jmlr2008 jmlr2008-26 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. [sent-5, score-0.446]
2 We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. [sent-6, score-0.379]
3 Keywords: convex optimization, singular value decomposition, trace norm, consistency 1. [sent-7, score-0.409]
4 When learning on rectangular matrices, the rank is a natural extension of the cardinality, and the sum of singular values, also known as the trace norm or the nuclear norm, is the natural extension of the 1 -norm; indeed, as the 1 -norm is the convex envelope of the 0 -norm on the unit ball (i. [sent-13, score-0.553]
5 , the largest lower bounding convex function) (Boyd and Vandenberghe, 2003), the trace norm is the convex envelope of the rank over the unit ball of the spectral norm (Fazel et al. [sent-15, score-0.411]
6 In this paper, we consider the rank consistency of trace norm regularization with the square loss, that is, if the data were actually generated by a low-rank matrix, will the matrix and its rank be consistently estimated? [sent-25, score-0.583]
7 In Section 4, we provide necessary and sufficient conditions for the rank consistency that are extensions of corresponding results for the Lasso (Yuan and Lin, 2007; Zhao and Yu, 2006; Zou, 2006) and the group Lasso (Bach, 2008). [sent-26, score-0.238]
8 In Appendix A and B, we review and derive relevant tools and results regarding perturbation of singular values as well as the trace norm. [sent-42, score-0.336]
9 (1) must lead to diagonal solutions (indeed the minimum trace norm matrix with fixed diagonal is the corresponding diagonal matrix, which is a consequence of Lemma 20 and Proposition 21) and for a diagonal matrix the trace norm is simply the 1 norm of the diagonal. [sent-63, score-0.462]
10 Similarly, the optimal W must share the same block-diagonal form, and its singular values are exactly the norms of each block, that is, the trace norm is indeed the sum of the norms of each group. [sent-76, score-0.48]
11 Note that the Lasso and group Lasso can be seen as special cases where the singular vectors are fixed. [sent-79, score-0.279]
12 However, the main difficulty in analyzing trace norm regularization, as well as the main reason for it use, is that singular vectors are not fixed and those can often be seen as implicit features learned by the estimation procedure (Srebro et al. [sent-80, score-0.414]
13 , n we have a linear prediction model, where the loading matrix W is non trivial and rank-deficient, the goal being to estimate this rank (as well as the matrix itself). [sent-118, score-0.279]
14 We let denote W = U Diag(s)V its singular value decomposition, with U ∈ R p×r , V ∈ Rq×r , and r ∈ (0, min{p, q}) denotes the rank of W. [sent-119, score-0.392]
15 , ny } is sampled uniformly, then if nx , ny and n tend to infinity, then −1/2 −1/2 (A2) and (A3) are satisfied with Σmm = Σyy ⊗ Σxx and ζn = n−1/2 + nx + ny . [sent-156, score-0.396]
16 (1), that we will constantly use in the paper: 1022 C ONSISTENCY OF T RACE N ORM M INIMIZATION Proposition 3 The matrix W with singular value decomposition W = U Diag(s)V positive singular values s) is optimal for the problem in Eq. [sent-159, score-0.564]
17 ˆ ˆ This implies notably that W and ΣmmW − ΣMz have simultaneous singular value decompositions, and the largest singular values are less than λn , and exactly equal to λn for the corresponding strictly positive singular values of W . [sent-161, score-0.794]
18 We also consider the rank conˆ sistency, that is, we want that P(rank(W ) = rank(W)) tends to zero as n tends to infinity. [sent-166, score-0.315]
19 1); c) if λn tends to zero exactly at rate n−1/2 , then the estimator is consistent with error O p (n−1/2 ) but the probability of estimating the correct rank is converging to a limit in (0, 1) (see Section 4. [sent-169, score-0.288]
20 2); d) if λn tends to zero more slowly than n−1/2 , then the estimate is consistent with error O p (λn ) and its rank consistency depends on specific consistency conditions detailed in Section 4. [sent-170, score-0.412]
21 ˆ n mm We now consider the corresponding rank consistency results, when λ n goes to zero faster than n−1/2 . [sent-190, score-0.533]
22 3) states that for such regularization parameter, the solution has rank strictly greater than r with probability tending to one and can thus not be rank consistent: ˆ Proposition 7 Assume (A1-3). [sent-192, score-0.397]
23 If n1/2 λn tends to a limit 1/2 (W − W) converges in distribution to the unique global minimizer of ˆ λ0 > 0, then n 1 min vec(∆) Σmm vec(∆) − tr∆ A + λ0 trU ∆V + U⊥ ∆V⊥ p×q 2 ∆∈R ∗ , where vec(A) ∈ R pq is normally distributed with mean zero and covariance matrix σ 2 Σmm . [sent-200, score-0.265]
24 If n1/2 λn tends to a limit λ0 > 0, then the probability that the rank ˆ of W is different from the rank of W is converging to P( Λ − λ−1 Θ 2 1) ∈ (0, 1) where Λ ∈ 0 (p−r)×(q−r) is defined in Eq. [sent-202, score-0.376]
25 3) and Θ ∈ R (p−r)×(q−r) has a normal distribution with R mean zero and covariance matrix σ2 (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 . [sent-204, score-0.398]
26 ˆ The previous proposition ensures that the estimate W cannot be rank consistent with this decay of the regularization parameter. [sent-205, score-0.282]
27 The condition U⊥ ∆V⊥ = 0 is thus necessary for rank consistency when λn n1/2 tends to infinity while λn tends to zero. [sent-221, score-0.364]
28 Lemma 11 Assume Σmm is invertible, and W = U Diag(s)V is the singular value decomposition of W. [sent-223, score-0.28]
29 Then the unique global minimizer of vec(∆) Σmm vec(∆) + trU ∆V + U⊥ ∆V⊥ ∗ satisfies U⊥ ∆V⊥ = 0 if and only if (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 (V ⊗ U) vec(I) mm 1. [sent-224, score-0.625]
30 2 This leads to consider the matrix Λ ∈ R(p−r)×(q−r) defined as vec(Λ) = (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 (V ⊗ U) vec(I) , mm (3) and the two weak and strict consistency conditions: Λ 2 1, (4) Λ 2 < 1. [sent-225, score-0.698]
31 (5) is sufficient for rank consistency when n 1/2 λn tends to infinity, while the condition Eq. [sent-228, score-0.288]
32 (4) is necessary for the existence of a sequence λ n such that the estimate is both consistent and rank consistent (which is a stronger result than restricting λ n to be tending to zero slower than n−1/2 ). [sent-229, score-0.261]
33 This is due to the fact that the first r singular vectors U and V of W + λn ∆ are not fixed; indeed, the r first singular vectors (i. [sent-247, score-0.506]
34 This is to be contrasted with the adaptive version where asymptotically the first order expansion has constant singular vectors (see Section 5). [sent-250, score-0.352]
35 Finally, in this paper, we have only proved whether the probability of correct rank selection tends to zero or one. [sent-251, score-0.239]
36 In this situation, the singular values of the diagonal matrix W = Diag(w) are the norms of the diagonal blocks, while the left singular vectors are equal to the normalized versions of the block (the signs for the Lasso). [sent-262, score-0.57]
37 We can then compute the singular value decomposition in closed form as / I U = (Diag(wi 0 wi )i r , V = 0 and s = ( w j ) j r . [sent-286, score-0.28]
38 We can put I 0 I −1 H (V ⊗ U) vec(I) = (Σ−1 ) these singular vectors into Eq. [sent-288, score-0.253]
39 Thus, for the group Lasso, we finally obtain: Λ = Diag ((Σ−1 )Jc Jc )−1 (Σ−1 )Jc ,J ηJ xx xx = 2 (Σxx )Jc J (Σxx )−1 ηJ J,J Diag 2 2 by the partitioned matrices inversion lemma, = max Σxi xJ Σ−1 J ηJ . [sent-293, score-0.261]
40 Moreover the condition Λ 2 1 is exactly the one for the group Lasso (Bach, 2008), where the pattern consistency is replaced by the consistency for the number of non zero groups. [sent-298, score-0.253]
41 More precisely, we consider the least-square estimate ˆ mm ˆ ˆ vec(WLS ) = Σ−1 vec(ΣMz ). [sent-305, score-0.297]
42 We have the following well known result for least-square regression: ˆ ˆ mm Lemma 14 Assume (A1-3). [sent-306, score-0.297]
43 mm ˆ We consider the singular value decomposition of WLS = ULS Diag(sLS )VLS , where sLS 0. [sent-308, score-0.577]
44 With ˆ probability tending to one, min{p, q} singular values are strictly positive (i. [sent-309, score-0.332]
45 We complete the singular values sLS ∈ Rmin{p,q} by n−1/2 to reach dimensions p and q (we keep the same notation for both dimensions for simplicity). [sent-313, score-0.253]
46 If γ ∈ (0, 1], n1/2 λn tends to 0 and λn n1/2+γ/2 tends to infinity, then ˆ any global minimizer WA of 1 n ∑ (zi − trW Mi )2 + λn AW B 2n i=1 ∗ ˆ is consistent and rank consistent. [sent-317, score-0.349]
47 3, we illustrate the previous theorem on the singular subspaces at rate O p (n synthetic examples. [sent-321, score-0.253]
48 In particular, we exhibit some singular behavior for the limiting case γ = 1. [sent-322, score-0.253]
49 Because the dual norm of the trace norm is the spectral norm (see Appendix B), the dual is easily obtained as 1 (8) max − vec(Q − λV ) Σ−1 vec(Q − λV ). [sent-333, score-0.35]
50 , that depends only on singular values of V , equal to B(V ) = min{p,q} b(si (V )) where b(s) = (1 + s) log(1 + s) + (1 − s) log(1 − s) if |s| 1 and +∞ otherwise ∑i=1 (si (V ) denotes the i-th largest singular values of V ). [sent-349, score-0.506]
51 Derivatives of spectral functions Note that derivatives of spectral functions of the form B(W ) = min{p,q} b(si (W )), where b is an even twice differentiable function such that b(0) = b (0) = 0, are ∑i=1 easily calculated as follows; Let U Diag(s)V be the singular value decomposition of W . [sent-358, score-0.346]
52 We then have the following Taylor expansion (Lewis and Sendov, 2002): B(W + ∆) = B(W ) + tr∆ U Diag(b (si ))V + 1 p q b (si ) − b (s j ) ∑ ∑ si − s j (ui ∆v j )2 , 2 i=1 j=1 where the vector of singular values is completed by zeros, and si = s j . [sent-359, score-0.331]
53 C ONSISTENCY OF T RACE N ORM M INIMIZATION consistent − adaptive (γ=1/2) consistent − non adaptive consistent − adaptive (γ=1) 2. [sent-372, score-0.372]
54 5 0 5 −log(λ) 0 10 singular values singular values 2 1 0 5 −log(λ) 0 5 −log(λ) 10 1 0 10 2. [sent-379, score-0.506]
55 5 inconsistent − adaptive (γ=1/2) inconsistent − non adaptive 2. [sent-385, score-0.297]
56 5 singular values 0 −5 singular values 3 singular values singular values 3 3 2. [sent-387, score-1.012]
57 5 0 5 −log(λ) 10 0 −5 0 5 −log(λ) 10 15 Figure 2: Examples of paths of singular values for Λ 2 = 0. [sent-390, score-0.253]
58 78 > 1 (inconsistent, bottom) rank selection: regular trace norm penalization (left) and adaptive penalization with γ = 1/2 (center) and γ = 1 (right). [sent-392, score-0.422]
59 Estimated singular values are plotted in plain, while population singular values are dotted. [sent-393, score-0.506]
60 1 ˆ In Figure 2, we plot regularization paths for n = 103 , by showing the singular values of W compared to the singular values of W, in two particular situations (Eq. [sent-407, score-0.546]
61 (5) satisfied and not satisfied), for the regular trace norm regularization and the adaptive versions, with γ = 1/2 and γ = 1. [sent-409, score-0.323]
62 Note that in the consistent case (top), the singular values and their cardinalities are well jointly estimated, both for the non adaptive version (as predicted by Theorem 12) and the adaptive 1. [sent-410, score-0.493]
63 However in the inconsistent case, the non adaptive regularizations scheme (bottom left) cannot achieve regular consistency together with rank consistency (Theorem 13), while the adaptive schemes can. [sent-416, score-0.584]
64 Note the particular behavior of the limiting case γ = 1, which still achieves both consistencies but with a singular behavior for large λ. [sent-417, score-0.253]
65 However, for the adaptive versions, it does, still with a somewhat singular behavior of the limiting case γ = 1. [sent-425, score-0.331]
66 Conclusion We have presented an analysis of the rank consistency for the penalization by the trace norm, and derived general necessary and sufficient conditions. [sent-430, score-0.295]
67 Tools for Analysis of Singular Value Decomposition In this appendix, we review and derive precise results regarding singular value decompositions. [sent-436, score-0.253]
68 8 0 5 −log(λ) 10 consistent − adaptive (γ=1) 2 1 P(correct rank) 0 −log(λ) consistent − adaptive (γ=1/2) 2 consistent − adaptive (γ=1/2) 1 P(correct rank) 0 0 −2 0. [sent-448, score-0.315]
69 1033 BACH inconsistent − non adaptive inconsistent − non adaptive 1. [sent-451, score-0.354]
70 6 5 inconsistent − adaptive (γ=1/2) 2 log(RMS) P(correct rank) inconsistent − adaptive (γ=1/2) 1 0 −log(λ) −2 −4 −6 −5 10 −log(λ) 0 5 −log(λ) 10 Figure 4: Synthetic example where consistency condition in Eq. [sent-466, score-0.313]
71 Note that when a singular value si is simple, that is, does not coalesce with any other singular values, then the vectors ui and vi are uniquely defined up to simultaneous sign flips, that is, only the matrix ui vi is unique. [sent-481, score-0.576]
72 However, when some singular values coalesce, then the corresponding singular vectors are defined up to a rotation, and thus in general care must be taken and considering isolated singular vectors should be avoided (Stewart and Sun, 1990). [sent-482, score-0.759]
73 All tools presented in this appendix are robust to the particular choice of the singular vectors. [sent-483, score-0.253]
74 1 Jordan-Wielandt Matrix We use the fact that singular values of W can be obtained from the eigenvalues of the Jordan0 W ¯ ∈ R(p+q)×(p+q) (Stewart and Sun, 1990). [sent-485, score-0.278]
75 , r, where si are the (strictly positive) singular values of W , ui ui 1 1 with eigenvectors √2 and √2 where ui , vi are the left and right associated singular vi −vi vectors. [sent-489, score-0.545]
76 2iπ C λI − W 2iπ C λI − W We let denote s1 and sr the largest and smallest strictly positive singular values of W ; if ∆ 2 < sr /2, then W + ∆ has r singular values strictly greater than sr /2 and the remaining ones are strictly less ¯ than sr /2 (Stewart and Sun, 1990). [sent-501, score-1.139]
77 Proposition 17 Assume W has rank r and ∆ 2 < sr /4 where sr is the smallest positive singular ¯ ¯ value of W . [sent-508, score-0.656]
78 ¯ The variations of Π(W ) translates immediately into variations of the singular projections UU and VV . [sent-512, score-0.253]
79 2 2 r Similarly, when restricted to the small singular values, the first order expansion is (I −UU )∆(I − VV ), with error term bounded in spectral norm by s4r ∆ 2 . [sent-514, score-0.364]
80 Those results lead to the following 2 proposition that gives a local sufficient condition for rank(W + ∆) > rank(W ): 1037 BACH Proposition 18 Assume W has rank r < min{p, q} with ordered singular value decomposition W = U Diag(s)V . [sent-515, score-0.495]
81 2 s3 r Proof The trace norm of W + ∆ ∗ may be divided into the sum of the r largest and the sum of the remaining singular values. [sent-530, score-0.414]
82 For the first r singular values, we need 2 ¯ ¯ to upperbound the second derivative of the sum of the r largest eigenvalues of W + ∆ with strictly positive eigengap, which leads to the given bound by using the same Cauchy residue technique described in Appendix A. [sent-532, score-0.313]
83 n We can thus compute the squared Frobenius norm: 1 n ˆ ˆ ∑ vec(Mk ) vec(Mk ) − Σyy ⊗ Σxx n k=1 = = 2 F 1 ˜˜ ˜˜ ˜˜ ˜˜ tr Diag(vec(S − n/nx ny ))(Y Y ⊗ X X ) Diag(vec(S − n/nx ny ))(Y Y ⊗ X X ) n2 1 ˜˜ ˜˜ ˜˜ ˜˜ ∑ (Si j − n/nx ny )(Y Y ⊗ X X )i j,i j (Si j − n/nx ny )(Y Y ⊗ X X )i j,i j . [sent-539, score-0.291]
84 n2 i, j,i , j We have, by properties of sampling without replacement (Hoeffding, 1963): E(Si j − n/nx ny )(Si j − n/nx ny ) = n/nx ny (1 − n/nx ny ) if (i, j) = (i , j ), 1 E(Si j − n/nx ny )(Si j − n/nx ny ) = −n/nx ny (1 − n/nx ny ) if (i, j) = (i , j ). [sent-540, score-0.512]
85 nx ny − 1 1039 BACH This implies E( 1 n ˆ ˆ F ˜ ˜ ∑ vec(Mk ) vec(Mk ) − Σyy ⊗ Σxx 2 |X, Y ) n k=1 1 1 ˜˜ ˜˜ i ˜˜ ˜˜ i = ∑(Y Y ⊗ X X )2j,i j − (nx ny − 1)nx ny n ∑ (Y Y ⊗ X X )2j,i nx ny n i, j (i, j)=(i , j ) j 2 yj ˜ nx ny n ∑ i, j 4 xi 4 . [sent-541, score-0.626]
86 2 Proof of Proposition 4 ˆ mm ˆ We may first restrict minimization over the ball {W, W ∗ Σ−1 ΣMz ∗ } because the optimum −1 Σ . [sent-556, score-0.297]
87 By Proposition 18 in Appendix B, if 4nsr ∆ 2 < U⊥ ∆V⊥ 2 , then mm 2 ˆ rank(W ) > r. [sent-562, score-0.297]
88 By the dominated convergence mm 2 theorem, f (C) converges to one when C → ∞. [sent-564, score-0.327]
89 By the asymptotic normality result, P( 0 ∃n0 > 0 such that ∀n > −1/2 4C0 n0 , P( sr −1/2 the proof, because P( 4nsr ˆ ∆ 2 2 ˆ ∆ ⊥ 2 sr 2 2 ˆ < U⊥ ∆V⊥ 2 ) > f (C0 ) − ε/2 > 1 − ε, which concludes ˆ < U⊥ ∆V⊥ 2 ) P( −1/2 4C0 sr ˆ ∆ 2 2 ˆ < U⊥ ∆V⊥ 2 ) as soon as n > C0 . [sent-566, score-0.416]
90 We consider the following events: ˆ E0 = {rank(W ) = r} −1/2 ˆ E1 = { n ∆ 2 < sr /2} E2 = 4n−1/2 ˆ ∆ sr 2 2 ˆ < U⊥ ∆V⊥ 2 . [sent-574, score-0.264]
91 For any A, when η tends to zero, the indicator function 1 U ∆(A)V⊥ 2 η converges to 1 U ∆(A)V⊥ 2 =0 , which is equal to 1 Λ(A) 2 λ0 , where ⊥ ⊥ vec(Λ(A)) = (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 ((V ⊗ U) vec(I)−vec(A)) . [sent-579, score-0.403]
92 mm 1041 BACH By the dominated convergence theorem, P( U⊥ ∆(A)V⊥ a = P( Λ(A) 2 2 η) converges to λ0 ), which is the proposed limit. [sent-580, score-0.327]
93 A sufficient condition for rank consistency ˆ = USV the singular value decomposition of W and we let denote ˆ is the following: we let denote W Uo and Vo the singular vectors corresponding to all but the r largest singular values. [sent-586, score-0.998]
94 Since we ˆ have simultaneous singular value decompositions, a sufficient condition is that rank(W ) r and ˆ ˆ mm (W − W) − ΣMε Vo < λn (1 − η). [sent-587, score-0.55]
95 7 Proof of Proposition 11 The optimal ∆ ∈ R p×q should be such that U⊥ ∆V⊥ has low rank, where U⊥ ∈ R p×(p−r) and V⊥ ∈ Rq×(q−r) are orthogonal complements of the singular vectors U and V. [sent-599, score-0.276]
96 We can solve explicitly for ∆ and Λ which leads to vec(Λ) = (V⊥ ⊗ U⊥ ) Σ−1 (V⊥ ⊗ U⊥ ) mm −1 (V⊥ ⊗ U⊥ ) Σ−1 (V ⊗ U) vec(I) , mm and vec(∆) = −Σ−1 vec(UV − U⊥ ΛV⊥ ). [sent-602, score-0.594]
97 From the regular consistency, the rank of W is, with probability tending to one, larger than r (because the rank is lower semi-continuous function). [sent-614, score-0.366]
98 We let denote W = USV the singular value decomposition of W and we let denote Uo and Vo the singular vectors corresponding to all but the r largest singular values. [sent-616, score-0.786]
99 Since we have simultaneous singular value decompositions, we simply need to show that, ˆ ˆ ˆ Uo Σmm (W − W) − ΣMε Vo 2 < λn with probability tending to one. [sent-617, score-0.297]
100 10 Proof of Theorem 15 o o r r We let denote ULS and VLS the first r columns of ULS and VLS and ULS and VLS the remaining columns; we also denote sr the corresponding first r singular values and so the remaining singular values. [sent-633, score-0.638]
wordName wordTfidf (topN-words)
[('vec', 0.688), ('mm', 0.297), ('singular', 0.253), ('uls', 0.225), ('diag', 0.153), ('rank', 0.139), ('sr', 0.132), ('lasso', 0.125), ('xx', 0.104), ('nx', 0.102), ('orm', 0.095), ('trw', 0.095), ('bach', 0.092), ('yy', 0.09), ('trace', 0.083), ('inimization', 0.081), ('norm', 0.078), ('adaptive', 0.078), ('tends', 0.076), ('proposition', 0.076), ('uo', 0.075), ('consistency', 0.073), ('onsistency', 0.073), ('ls', 0.07), ('vo', 0.069), ('mz', 0.068), ('sls', 0.068), ('race', 0.066), ('ny', 0.064), ('awb', 0.061), ('vls', 0.061), ('mi', 0.06), ('directional', 0.058), ('non', 0.057), ('pq', 0.05), ('uu', 0.05), ('mk', 0.049), ('abernethy', 0.048), ('tending', 0.044), ('regular', 0.044), ('inconsistent', 0.042), ('srebro', 0.041), ('tru', 0.041), ('regularization', 0.04), ('si', 0.039), ('int', 0.038), ('yuan', 0.038), ('strictly', 0.035), ('rq', 0.035), ('tr', 0.035), ('subdifferential', 0.035), ('fazel', 0.034), ('recht', 0.034), ('norms', 0.033), ('rr', 0.033), ('collaborative', 0.033), ('spectral', 0.033), ('minimizer', 0.031), ('matrix', 0.031), ('zou', 0.031), ('uv', 0.031), ('invertibility', 0.031), ('converges', 0.03), ('vv', 0.028), ('martingale', 0.027), ('ro', 0.027), ('uouo', 0.027), ('vovo', 0.027), ('invertible', 0.027), ('decomposition', 0.027), ('consistent', 0.027), ('matrices', 0.027), ('group', 0.026), ('log', 0.026), ('jk', 0.025), ('eigenvalues', 0.025), ('lim', 0.024), ('zero', 0.024), ('projector', 0.023), ('supn', 0.023), ('wls', 0.023), ('orthogonal', 0.023), ('rms', 0.023), ('covariance', 0.023), ('normal', 0.023), ('converging', 0.022), ('boyd', 0.022), ('asymptotically', 0.021), ('stewart', 0.021), ('loading', 0.021), ('jc', 0.021), ('borwein', 0.02), ('eigengap', 0.02), ('oo', 0.02), ('rennie', 0.02), ('trv', 0.02), ('usv', 0.02), ('zhao', 0.02), ('vandenberghe', 0.02), ('concludes', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999833 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
2 0.11238918 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
Author: Francis R. Bach
Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators
3 0.11086307 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
4 0.085994981 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont
Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data
5 0.077270955 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
6 0.064420253 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
7 0.043950912 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
8 0.043907661 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
9 0.039017823 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
10 0.037482589 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
11 0.033745389 92 jmlr-2008-Universal Multi-Task Kernels
12 0.033164449 86 jmlr-2008-SimpleMKL
13 0.030969987 78 jmlr-2008-Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
14 0.029521996 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
15 0.029405743 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
16 0.027595421 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
17 0.027104951 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
18 0.025897954 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
19 0.025714546 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
20 0.025308874 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
topicId topicWeight
[(0, 0.147), (1, -0.063), (2, -0.095), (3, 0.015), (4, 0.121), (5, -0.033), (6, -0.059), (7, 0.085), (8, -0.163), (9, -0.103), (10, 0.011), (11, -0.11), (12, -0.109), (13, 0.183), (14, 0.222), (15, -0.203), (16, 0.009), (17, -0.063), (18, -0.291), (19, 0.025), (20, 0.098), (21, 0.031), (22, 0.086), (23, 0.002), (24, 0.037), (25, 0.029), (26, -0.052), (27, 0.152), (28, -0.114), (29, -0.013), (30, -0.067), (31, 0.155), (32, -0.061), (33, -0.033), (34, 0.085), (35, -0.004), (36, -0.096), (37, 0.061), (38, -0.04), (39, 0.049), (40, -0.044), (41, -0.069), (42, 0.072), (43, -0.072), (44, 0.025), (45, 0.027), (46, 0.044), (47, 0.044), (48, 0.066), (49, 0.058)]
simIndex simValue paperId paperTitle
same-paper 1 0.96877456 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
2 0.73689836 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
Author: Francis R. Bach
Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators
3 0.43408626 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
Author: Onureena Banerjee, Laurent El Ghaoui, Alexandre d'Aspremont
Abstract: We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive 1 -norm penalized regression. Our second algorithm, based on Nesterov’s first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan, 2006), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data. Keywords: model selection, maximum likelihood estimation, convex optimization, Gaussian graphical model, binary data
4 0.41912735 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
5 0.34031743 56 jmlr-2008-Magic Moments for Structured Output Prediction
Author: Elisa Ricci, Tijl De Bie, Nello Cristianini
Abstract: Most approaches to structured output prediction rely on a hypothesis space of prediction functions that compute their output by maximizing a linear scoring function. In this paper we present two novel learning algorithms for this hypothesis class, and a statistical analysis of their performance. The methods rely on efficiently computing the first two moments of the scoring function over the output space, and using them to create convex objective functions for training. We report extensive experimental results for sequence alignment, named entity recognition, and RNA secondary structure prediction. Keywords: structured output prediction, discriminative learning, Z-score, discriminant analysis, PAC bound
6 0.32222921 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
7 0.20305113 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
8 0.18916909 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
9 0.17676616 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
10 0.16832836 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
11 0.14996465 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
12 0.14439447 57 jmlr-2008-Manifold Learning: The Price of Normalization
13 0.13847901 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
14 0.13489473 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
15 0.12859392 92 jmlr-2008-Universal Multi-Task Kernels
16 0.1230628 52 jmlr-2008-Learning from Multiple Sources
17 0.11700856 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
18 0.11495373 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
19 0.11369272 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
20 0.11252119 86 jmlr-2008-SimpleMKL
topicId topicWeight
[(0, 0.025), (40, 0.023), (54, 0.038), (58, 0.106), (61, 0.419), (66, 0.102), (76, 0.017), (78, 0.017), (88, 0.052), (92, 0.026), (94, 0.036), (99, 0.019)]
simIndex simValue paperId paperTitle
1 0.84007162 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
same-paper 2 0.82030141 26 jmlr-2008-Consistency of Trace Norm Minimization
Author: Francis R. Bach
Abstract: Regularization by the sum of singular values, also referred to as the trace norm, is a popular technique for estimating low rank rectangular matrices. In this paper, we extend some of the consistency results of the Lasso to provide necessary and sufficient conditions for rank consistency of trace norm minimization with the square loss. We also provide an adaptive version that is rank consistent even when the necessary condition for the non adaptive version is not fulfilled. Keywords: convex optimization, singular value decomposition, trace norm, consistency
3 0.49682552 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
Author: Zongming Ma, Xianchao Xie, Zhi Geng
Abstract: Chain graphs present a broad class of graphical models for description of conditional independence structures, including both Markov networks and Bayesian networks as special cases. In this paper, we propose a computationally feasible method for the structural learning of chain graphs based on the idea of decomposing the learning problem into a set of smaller scale problems on its decomposed subgraphs. The decomposition requires conditional independencies but does not require the separators to be complete subgraphs. Algorithms for both skeleton recovery and complex arrow orientation are presented. Simulations under a variety of settings demonstrate the competitive performance of our method, especially when the underlying graph is sparse. Keywords: chain graph, conditional independence, decomposition, graphical model, structural learning
4 0.44241995 20 jmlr-2008-Causal Reasoning with Ancestral Graphs (Special Topic on Causality)
Author: Jiji Zhang
Abstract: Causal reasoning is primarily concerned with what would happen to a system under external interventions. In particular, we are often interested in predicting the probability distribution of some random variables that would result if some other variables were forced to take certain values. One prominent approach to tackling this problem is based on causal Bayesian networks, using directed acyclic graphs as causal diagrams to relate post-intervention probabilities to pre-intervention probabilities that are estimable from observational data. However, such causal diagrams are seldom fully testable given observational data. In consequence, many causal discovery algorithms based on data-mining can only output an equivalence class of causal diagrams (rather than a single one). This paper is concerned with causal reasoning given an equivalence class of causal diagrams, represented by a (partial) ancestral graph. We present two main results. The first result extends Pearl (1995)’s celebrated do-calculus to the context of ancestral graphs. In the second result, we focus on a key component of Pearl’s calculus—the property of invariance under interventions, and give stronger graphical conditions for this property than those implied by the first result. The second result also improves the earlier, similar results due to Spirtes et al. (1993). Keywords: ancestral graphs, causal Bayesian network, do-calculus, intervention
5 0.4354777 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
Author: Francis R. Bach
Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators
7 0.37063479 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
8 0.35137075 84 jmlr-2008-Search for Additive Nonlinear Time Series Causal Models
9 0.34769717 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
10 0.33656621 9 jmlr-2008-Active Learning by Spherical Subdivision
11 0.3363958 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
12 0.32920614 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
13 0.31904131 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
14 0.30782434 93 jmlr-2008-Using Markov Blankets for Causal Structure Learning (Special Topic on Causality)
15 0.30576494 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
16 0.30364621 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
17 0.30341396 24 jmlr-2008-Complete Identification Methods for the Causal Hierarchy (Special Topic on Causality)
18 0.29699165 57 jmlr-2008-Manifold Learning: The Price of Normalization
19 0.29597974 56 jmlr-2008-Magic Moments for Structured Output Prediction
20 0.29562399 83 jmlr-2008-Robust Submodular Observation Selection