jmlr jmlr2010 jmlr2010-105 knowledge-graph by maker-knowledge-mining

105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Source: pdf

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efﬁcient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm S OFT-I MPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efﬁciently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semideﬁnite-programming algorithm is readily scalable to large matrices; for example S OFT-I MPUTE takes a few hours to compute low-rank approximations of a 106 × 106 incomplete matrix with 107 observed entries, and ﬁts a rank-95 approximation to the full Netﬂix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords: collaborative ﬁltering, nuclear norm, spectral regularization, netﬂix prize, large scale convex optimization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Health, Research and Policy Stanford University Stanford, CA 94305 Editor: Tommi Jaakkola Abstract We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. [sent-4, score-0.148]

2 Using the nuclear norm as a regularizer, we provide a simple and very efﬁcient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. [sent-5, score-0.463]

3 With warm starts this allows us to efﬁciently compute an entire regularization path of solutions on a grid of values of the regularization parameter. [sent-7, score-0.159]

4 Keywords: collaborative ﬁltering, nuclear norm, spectral regularization, netﬂix prize, large scale convex optimization 1. [sent-13, score-0.254]

5 The problem is to “complete” the matrix based on the observed entries, and has been dubbed the matrix completion problem (Cand` s and Recht, e 2008; Cand` s and Tao, 2009; Rennie and Srebro, 2005). [sent-15, score-0.118]

6 In this recommender-system example, low rank structure suggests that movies can be grouped into a small number of “genres”, with Gℓ j the relative score for movie j in genre ℓ. [sent-35, score-0.223]

7 The rank constraint in (1) makes the problem for general Ω combinatorially hard (Srebro and Jaakkola, 2003). [sent-50, score-0.158]

8 Here Z ∗ is the nuclear norm, or the sum of the singular values of Z. [sent-53, score-0.242]

9 Under many situations the nuclear norm is an effective convex relaxation to the rank constraint (Fazel, 2002; Cand` s and Recht, 2008; Cand` s and Tao, 2009; Recht et al. [sent-54, score-0.418]

10 2 (i,∑ j)∈Ω (3) ˆ Here λ ≥ 0 is a regularization parameter controlling the nuclear norm of the minimizer Zλ of (3); there is a 1-1 mapping between δ ≥ 0 and λ ≥ 0 over their active domains. [sent-61, score-0.269]

11 2288 M ATRIX C OMPLETION BY S PECTRAL R EGULARIZATION In this paper we propose an algorithm S OFT-I MPUTE for the nuclear norm regularized leastsquares problem (3) that scales to large problems with m, n ≈ 105 –106 with around 106 –108 or more observed entries. [sent-62, score-0.24]

12 The computational aspects of the algorithm are described in Section 5, and Section 6 discusses how nuclear norm regularization can be generalized to more aggressive and general types of spectral regularization. [sent-79, score-0.321]

13 (5) In (5) YSP has the same sparsity structure as the observed X, and YLR has rank r ≪ m, n, where r ˜ ˜ is very close to r ≪ m, n the rank of the estimated matrix Z (upon convergence of the algorithm). [sent-111, score-0.375]

14 Although the nuclear norm is motivated here as a convex relaxation to a rank constraint, we believe in many situations it will outperform the rank-restricted estimator (1). [sent-121, score-0.418]

15 The nuclear norm is the ℓ1 penalty in matrix completion, as compared to the ℓ0 rank. [sent-130, score-0.3]

16 where the proof uses the sub-gradient characterization of the nuclear norm. [sent-163, score-0.206]

17 We see test and training error in the top rows as a function of the nuclear norm, obtained from a grid of values Λ. [sent-194, score-0.256]

18 Unlike generic ﬁrst-order methods (Nesterov, 2003) including competitive ﬁrst-order methods for nuclear norm regularized problems (Cai et al. [sent-198, score-0.24]

19 SOFT- IMPUTE produces a sequence of solutions for which the criterion decreases to the optimal solution with every iteration and the successive iterates get closer to the optimal set of solutions of the problem 10. [sent-206, score-0.121]

20 Lemma 3 The nuclear norm shrinkage operator Sλ (·) satisﬁes the following for any W1 , W2 (with matching dimensions) Sλ (W1 ) − Sλ (W2 ) 2 F ≤ W1 −W2 In particular this implies that Sλ (W ) is a continuous map in W . [sent-218, score-0.259]

21 Though we cannot prove theoretically that every iterate of the sequence Zλ will be of low-rank; this observation is rather practical based on the manner in which we trace out the entire path of solutions based on warm-starts. [sent-284, score-0.121]

22 Assume that at the current iterate, the matrix Zλ has rank r. [sent-287, score-0.197]

23 Supposing r we want a r rank SVD of the matrix (22), the cost will be of the order of O(|Ω|˜) + O((m + n)(˜)2 ) ˜ r r k+1 k ). [sent-296, score-0.197]

24 Suppose the rank of the solution Z k is r, then in (for that iteration, that is, to obtain Zλ from Zλ λ light of our above observations r ≈ r ≪ min{m, n} and the order is O(|Ω|r) + O((m + n)r2 ). [sent-297, score-0.18]

25 , 2008) for exact matrix completion (4) involves calculating the SVD of a sparse matrix with cost O(|Ω|). [sent-303, score-0.118]

26 In addition, since the true rank of the matrix r ≪ min{m, n}, the computational cost of evaluating the truncated SVD (with rank ≈ r) is linear in matrix dimensions. [sent-310, score-0.394]

27 Generalized Spectral Regularization: From Soft to Hard Thresholding In Section 1 we discussed the role of the nuclear norm as a convex surrogate for the rank of a matrix, and drew the analogy with LASSO regression versus best-subset selection. [sent-327, score-0.418]

28 This best rank-k solution also solves minimize 1 PΩ (X) − PΩ (Z) 2 2 F + λ ∑ I(γ j (Z) > 0), j where γ j (Z) is the jth singular value of Z, and for a suitable choice of λ that produces a solution with rank k. [sent-332, score-0.286]

29 This goes beyond nuclear norm regularization by using slightly more aggressive penalties that bridge the gap between ℓ1 (nuclear norm) and ℓ0 (rank constraint). [sent-354, score-0.291]

30 j The solution is given by a thresholded SVD of W , Sp (W ) = UDp,λV ′ , λ where Dp,λ is a entry-wise thresholding of the diagonal entries of the matrix D consisting of singular values of the matrix W . [sent-385, score-0.221]

31 (2007) and Bach (2008) also consider the rank estimation problem from a theoretical standpoint. [sent-391, score-0.158]

32 Post-processing of “Selectors” and Initialization Because the ℓ1 norm regularizes by shrinking the singular values, the number of singular values retained (through cross-validation, say) may exceed the actual rank of the matrix. [sent-393, score-0.336]

33 Here rλ is the rank of Zλ and Zλ = UDλV ′ is its SVD. [sent-402, score-0.158]

34 Rather than estimating a diagonal matrix Dα as above, one can insert a matrix Mrλ ×rλ between U and V above to obtain better training error for the same rank. [sent-405, score-0.117]

35 Hence given U, V (each of rank rλ ) from the S OFT-I MPUTE algorithm, we solve ˆ M = arg min M PΩ (X) − PΩ (UMV ′ ) 2 , (26) ˆ ˆ where, Zλ = U MV ′ . [sent-406, score-0.189]

36 Criterion (26) will deﬁnitely lead to a decrease ˆ in training error as that attained by Z = UDλV ′ for the same rank and is potentially an attractive proposal for the original problem (1). [sent-409, score-0.197]

37 In many simulated examples we have observed that this post-processing step gives a good estimate of the underlying true rank of the matrix (based on prediction error). [sent-413, score-0.219]

38 A reasonable prescription for warms-starts is the nuclear norm solution via (S OFT-I MPUTE), or the post processed version (25). [sent-415, score-0.262]

39 S OFT-I MPUTE performs rank reduction and shrinkage at the same time, in one smooth convex operation. [sent-431, score-0.197]

40 In either case, no matter what r′ we use, the solution matrices U and V have the ˆ same rank as Z. [sent-449, score-0.199]

41 We conjecture that rank [Z(λ)] is monotone non-increasing in λ. [sent-452, score-0.158]

42 However, in practice the red squares are not known to MMMF, nor ˆ ˆ is the actual rank of the solution. [sent-458, score-0.158]

43 ˆ Criterion (28) is convex in Z for every value of λ, and it outputs the solution Z in the form of its soft-thresholded SVD, implying that the “factors” U,V are already orthogonal and the rank is known. [sent-466, score-0.2]

44 MMMF has two different tuning parameters r′ and λ, both of which are related to the rank or spectral properties of the matrices U,V . [sent-469, score-0.207]

45 (2009) which establish (29) for r = min{m, n}—we give a tighter estimate of the rank k of the underlying matrices. [sent-480, score-0.158]

46 The criterion (27) can be written as min U,V = min U,V = min Z 1 2 ||PΩ (X −UV T )||2 + λ ( U F 2 1 2 ||PΩ (X 2 F + V −UV T )||2 + λ UV T F 1 2 ||PΩ (X 2) F ∗ (30) (by Lemma 6) − Z)||2 + λ Z ∗ . [sent-486, score-0.115]

47 Note that if we know that the solution Z ∗ to (28) with λ = λ∗ has rank r∗ , then Z ∗ also solves min Z, rank(Z)=r∗ 1 2 ||PΩ (X − Z)||2 F +λ Z ∗ . [sent-489, score-0.211]

48 We now repeat the steps (30)—(31), restricting the rank r′ of U and V to be r′ = r∗ , and the result follows. [sent-490, score-0.158]

49 Figures 2, 3 and 4 show training and test error for all of the algorithms mentioned above—both as a function of nuclear norm and rank—for the three problem instances. [sent-521, score-0.279]

50 The performance of MMMF is displayed only in the plots with the nuclear norm along the horizontal axis, since the algorithm does not deliver a precise rank. [sent-525, score-0.24]

51 SNR, true rank and percentage of missing entries are indicated in the ﬁgures. [sent-526, score-0.218]

52 There is a unique correspondence between λ and nuclear norm. [sent-527, score-0.182]

53 The plots versus rank indicate how effective the nuclear norm is as a rank approximation—that is whether it recovers the true rank while minimizing prediction error. [sent-528, score-0.736]

54 In Figure 2 the true rank is 10, while in Figure 3 it is 6. [sent-536, score-0.158]

55 In both ﬁgures the training error for S OFT-I MPUTE (and hence MMMF) wins as a function of nuclear norm (as it must, by construction), but the more aggressive ﬁtters S OFT-I MPUTE + and H ARD -I MPUTE have better training error as a function of rank. [sent-541, score-0.34]

56 Though the nuclear norm is often viewed as a surrogate for the rank of a matrix, we see in these examples that it can provide a superior mechanism for regularization. [sent-542, score-0.398]

57 2303 M AZUMDER , H ASTIE AND T IBSHIRANI 50% missing entries with SNR=1, true rank =10 Training Error Test Error 1 1 0. [sent-545, score-0.218]

58 Both S OFT-I MPUTE and S OFT-I MPUTE + perform well (prediction error) in the presence of noise; the latter estimates the actual rank of the matrix. [sent-584, score-0.158]

59 MMMF (with full rank 100 factor matrices) has performance similar to S OFT-I MPUTE. [sent-585, score-0.158]

60 SVT also has poor prediction error, conﬁrming our claim in this example that criterion (4) can result in overﬁtting; it recovers a matrix with high nuclear norm and rank > 60 where the true rank is only 10. [sent-587, score-0.639]

61 2304 M ATRIX C OMPLETION BY S PECTRAL R EGULARIZATION 50% missing entries with SNR=1, true rank =6 Training Error Test Error 1 0. [sent-590, score-0.218]

62 Both H ARD -I MPUTE and O PT S PACE have poor prediction error apart from near the true rank 6 of the matrix, where they show reasonable performance. [sent-623, score-0.201]

63 SVT has very poor prediction error; it recovers a matrix with high nuclear norm and rank > 60, where the true rank is only 6. [sent-624, score-0.617]

64 2305 M AZUMDER , H ASTIE AND T IBSHIRANI 80% missing entries with SNR=10, true rank =5 Training Error Test Error 1 1 0. [sent-626, score-0.218]

65 It gets the correct rank whereas O PT S PACE slightly overestimates the rank. [sent-663, score-0.158]

66 Although the noise is low here, SVT recovers a matrix with high rank (approximately 30) and has poor prediction error as well. [sent-666, score-0.24]

67 The true underlying rank is 5, but the proportion of missing entries is much higher at eighty percent. [sent-669, score-0.218]

68 Test errors of both S OFT-I MPUTE + and S OFT-I MPUTE are found to decrease till a large nuclear norm after which they become roughly the same, suggesting no further impact of regularization. [sent-670, score-0.258]

69 MMMF has slightly better test error than S OFT-I MPUTE around a nuclear norm of 350, while in theory they should be identical. [sent-671, score-0.261]

70 O PT S PACE performs well in this high-SNR example, achieving a sharp minima at the true rank of the matrix. [sent-674, score-0.158]

71 For MMMF the time to train the models increase with increasing rank r′ ; and in case the underlying matrix has rank which is larger than r′ , the computational cost will be large in order to get competitive predictive accuracy. [sent-694, score-0.355]

72 S OFT-I MPUTE,“rank” denotes the rank of the recovered matrix, at the optimally chosen value of λ. [sent-747, score-0.158]

73 4375 with the recovered solution having a rank of 225. [sent-756, score-0.18]

74 3 Comparison with Nesterov’s Accelerated Gradient Method Ji and Ye (2009) proposed a ﬁrst-order algorithm based on Nesterov’s acceleration scheme (Nesterov, 2007), for nuclear norm minimization for a generic multi-task learning problem (Argyriou et al. [sent-768, score-0.24]

75 The horizontal axis corresponds to the standardized nuclear norm, ˆ with C = maxλ Zλ ∗ . [sent-840, score-0.182]

76 Shown are the times till convergence for the two algorithms over an entire grid of λ values for examples i–iv (in the last the matrix dimensions are much larger). [sent-841, score-0.112]

77 001 Recovered rank 40∗ 40∗ 50∗ 5 40∗ 5 5 20∗ Time (mins) Test error Training error 0. [sent-861, score-0.2]

78 Recovered rank is the rank of the solution matrix ˆ Z at the value of λ used in (10). [sent-887, score-0.377]

79 We applied S OFT-I MPUTE to the training data matrix and then performed a least-squares unshrinking on the singular values with the singular vectors and the training data row and column means as the bases. [sent-909, score-0.195]

80 The solution of the above problem is given by Sλ (u′W vi ) = (u′W vi − λ)+ , the soft˜i ˜ ˜i ˜ ˜i ˜ thresholding of u′W vi (without loss of generality, we can take u′W vi to be non-negative). [sent-970, score-0.29]

81 , n; obtained from (34) in (33) we get ˜ ˜ Q(U, V ) = 1 2 Z 2 F n ˜i ˜ ˜i ˜ ˜i ˜ − 2 ∑ (u′W vi − λ)+ (u′W vi − λ) + (u′ X vi − λ)2 . [sent-979, score-0.183]

82 (U, V ) is equivalent to maximizing n ∑ 2(u′W vi − λ)+ (u′W vi − λ) − (u′W vi − λ)2 = ˜i ˜ ˜i ˜ ˜i ˜ + i=1 ∑ (u′W vi − λ)2 . [sent-983, score-0.244]

83 , vi−1 } ˆ ˆ ˆ ˆ u 2 ≤1, v 2 ≤1 2 2 is solved by ui , vi , the left and right singular vectors of the matrix W corresponding to its ith largest ˆ ˆ singular value. [sent-990, score-0.243]

84 If r(λ) denotes the number of singular values of W larger than λ then the (ui , vi ) , i = ˜i ˜ ˜ ˜ 1, . [sent-996, score-0.121]

85 , vr(λ) ; the r(λ) left and right singular vectors of W corresponding to the largest singular values. [sent-1005, score-0.12]

86 ˆ Since the rank of W is r, the minimizer Z of (8) is given by UDλV ′ as in (9). [sent-1013, score-0.158]

87 The solution p of the resultant univariate minimization problem will be given by Sλ (u′W vi ) for some generalized ˜i ˜ p “thresholding operator” Sλ (·), where p Sλ (u′W vi ) = arg min ˜i ˜ d˜i ≥0 1 −2d˜i u′W vi + d˜i2 + λp(d˜i ). [sent-1015, score-0.236]

88 ˜i ˜ 2 The optimization problem analogous to (35) will be minimize ˜ ˜ U,V 1 2 Z 2 F n n i=1 i=1 ˜i ˜ − 2 ∑ dˆi u′W vi + ∑ dˆi2 + λ ∑ p(dˆi ), 2313 i (37) M AZUMDER , H ASTIE AND T IBSHIRANI p where dˆi = Sλ (u′W vi ), ∀i. [sent-1016, score-0.146]

89 The solution ing in ui ˜i ˜ will correspond to the ﬁrst few largest left and right singular vectors of the matrix W . [sent-1018, score-0.144]

90 The optimal p singular values will correspond to the relevant shrinkage/ threshold operator Sλ (·) operated on the singular values of W . [sent-1019, score-0.12]

91 In particular for the indicator function p(t) = λ 1(t = 0), the top few singular values (un-shrunk) and the corresponding singular vectors is the solution. [sent-1020, score-0.12]

92 The sub-gradients of the nuclear norm Z ∗ are given by ∂ Z ∗ = {UV ′ + ω : ωm×n , U ′ ω = 0, ωV = 0, ω 2 ≤ 1}, (39) where Z = UDV ′ is the SVD of Z. [sent-1026, score-0.24]

93 4 Proof of Lemma 5 Proof The sub-gradients of the nuclear norm Z ∂ Z ∗ ∗ are given by = {UV ′ +W : Wm×n , U ′W = 0, WV = 0, W 2 ≤ 1}, (48) k−1 k where Z = UDV ′ is the SVD of Z. [sent-1054, score-0.24]

94 Proof Let Zm×n = Lm×r Σr×r RT denote the SVD of Z, where r is the rank of the matrix Z. [sent-1083, score-0.197]

95 Suppose Z is of rank 1 2 ˜ ˜ k ≤ min(m, n), and denote its SVD by Zm×n = Lm×k Σk×k RT . [sent-1098, score-0.158]

96 Also it is easily seen that the minimum cannot be attained for any r < k; hence the minimal rank r for which (29) holds true is r = k. [sent-1102, score-0.158]

97 A singular value thresholding algorithm for matrix completion, 2008. [sent-1182, score-0.123]

98 Interior-point method for nuclear norm approximation with application to system identﬁcation. [sent-1284, score-0.24]

99 Fixed point and Bregman iterative methods for matrix rank minimization. [sent-1290, score-0.197]

100 Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, 2007. [sent-1311, score-0.307]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mpute', 0.697), ('mmmf', 0.374), ('svd', 0.19), ('nuclear', 0.182), ('rank', 0.158), ('astie', 0.132), ('azumder', 0.132), ('ibshirani', 0.132), ('svt', 0.125), ('ompletion', 0.107), ('pectral', 0.096), ('oft', 0.095), ('mp', 0.092), ('ard', 0.083), ('pace', 0.078), ('uv', 0.078), ('srebro', 0.076), ('trace', 0.072), ('atrix', 0.07), ('egularization', 0.067), ('nesterov', 0.065), ('cand', 0.063), ('snr', 0.063), ('impute', 0.062), ('vi', 0.061), ('singular', 0.06), ('pt', 0.058), ('rennie', 0.058), ('norm', 0.058), ('esterov', 0.056), ('ix', 0.055), ('cai', 0.054), ('um', 0.052), ('ye', 0.049), ('mazumder', 0.044), ('propack', 0.044), ('keshavan', 0.044), ('net', 0.042), ('completion', 0.04), ('movies', 0.04), ('matrix', 0.039), ('warm', 0.038), ('entries', 0.037), ('ofti', 0.037), ('ud', 0.037), ('ratings', 0.036), ('lemma', 0.035), ('grid', 0.035), ('ji', 0.035), ('recht', 0.034), ('min', 0.031), ('stanford', 0.031), ('spectral', 0.03), ('old', 0.03), ('lrt', 0.029), ('zi', 0.029), ('regularization', 0.029), ('solutions', 0.028), ('rmse', 0.026), ('tao', 0.026), ('hastie', 0.025), ('genre', 0.025), ('accelerated', 0.025), ('minimize', 0.024), ('thresholding', 0.024), ('proof', 0.024), ('ui', 0.023), ('lasso', 0.023), ('missing', 0.023), ('subsequence', 0.023), ('viewer', 0.023), ('collaborative', 0.022), ('burer', 0.022), ('llt', 0.022), ('lm', 0.022), ('rrt', 0.022), ('udv', 0.022), ('zm', 0.022), ('solution', 0.022), ('criterion', 0.022), ('prediction', 0.022), ('aggressive', 0.022), ('matlab', 0.022), ('sequence', 0.021), ('fazel', 0.021), ('abernethy', 0.021), ('error', 0.021), ('penalty', 0.021), ('convex', 0.02), ('convergence', 0.02), ('timing', 0.019), ('probe', 0.019), ('shrinkage', 0.019), ('secs', 0.019), ('viewers', 0.019), ('matrices', 0.019), ('nk', 0.018), ('till', 0.018), ('training', 0.018), ('diag', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

2 0.1695682 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative ﬁltering (the ‘Netﬂix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization

3 0.10165539 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative ﬁltering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization

4 0.049576554 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.048641395 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

Author: Tong Zhang

Abstract: We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: • Heuristic methods such as gradient descent that only ﬁnd a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. • Convex relaxation such as L1 -regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a speciﬁc multistage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets. Keywords: sparsity, non-convex optimization, convex relaxation, multi-stage convex relaxation

6 0.043108333 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

7 0.040870115 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

8 0.039402798 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

9 0.039382782 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

10 0.038384907 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

11 0.038299158 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

12 0.030291667 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

13 0.030115254 82 jmlr-2010-On Learning with Integral Operators

14 0.029753778 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

15 0.028614908 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

16 0.028096674 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

17 0.02757882 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

18 0.027033698 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

19 0.026718121 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

20 0.026116638 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.145), (1, -0.09), (2, 0.091), (3, -0.009), (4, -0.09), (5, -0.043), (6, 0.022), (7, -0.097), (8, 0.027), (9, -0.086), (10, -0.084), (11, 0.121), (12, 0.032), (13, 0.046), (14, 0.084), (15, -0.108), (16, 0.06), (17, -0.1), (18, -0.143), (19, 0.089), (20, -0.141), (21, -0.019), (22, 0.292), (23, 0.062), (24, 0.231), (25, -0.319), (26, -0.049), (27, 0.045), (28, 0.009), (29, -0.049), (30, 0.031), (31, -0.005), (32, -0.065), (33, 0.036), (34, 0.156), (35, 0.058), (36, -0.177), (37, 0.013), (38, 0.094), (39, 0.13), (40, -0.021), (41, -0.126), (42, 0.01), (43, 0.06), (44, -0.075), (45, 0.019), (46, 0.04), (47, 0.015), (48, 0.031), (49, -0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92986232 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

2 0.76401567 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

3 0.59428716 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

4 0.2395833 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond

Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan

Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classiﬁcation jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efﬁcient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a ﬂexible and efﬁcient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition

5 0.22845717 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

6 0.21404552 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

7 0.20576225 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

8 0.19527143 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

9 0.19164878 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

10 0.19103289 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

11 0.18882996 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

12 0.18876368 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

13 0.18422934 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

14 0.18067835 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

15 0.17687175 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

16 0.17169501 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

17 0.15184423 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

18 0.1476799 10 jmlr-2010-An Exponential Model for Infinite Rankings

19 0.14219593 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

20 0.14156967 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.107), (4, 0.012), (5, 0.013), (8, 0.027), (15, 0.018), (21, 0.02), (32, 0.081), (33, 0.012), (36, 0.026), (37, 0.049), (51, 0.031), (60, 0.28), (75, 0.143), (81, 0.019), (85, 0.054), (96, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70482373 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

2 0.57929802 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

3 0.56197131 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be deﬁned as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that deﬁnes the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difﬁcult to check, our conditions are straightforward and intuitive: integrally strictly positive deﬁnite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und

4 0.5536111 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

5 0.5510236 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

6 0.53525126 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

7 0.53403181 82 jmlr-2010-On Learning with Integral Operators

8 0.53398424 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

9 0.53305221 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

10 0.53219688 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

11 0.53113276 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

12 0.53045702 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

13 0.52897191 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions