jmlr jmlr2010 jmlr2010-72 knowledge-graph by maker-knowledge-mining

72 jmlr-2010-Matrix Completion from Noisy Entries

Source: pdf

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative ﬁltering (the ‘Netﬂix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization 1. [sent-9, score-0.145]

2 Given a matrix M, its largest singular values—and the associated singular vectors—‘explain’ the most signiﬁcant correlations in the underlying data source. [sent-11, score-0.282]

3 In many practical circumstances we have access only to a sparse subset of the entries of an m × n matrix M. [sent-14, score-0.211]

4 It has recently been discovered that, if the matrix M has rank r, and unless it is too ‘structured’, a small random subset of its entries allow to reconstruct it exactly. [sent-15, score-0.303]

5 O PT S PACE is intrinsically of low complexity, the most complex operation being computing r singular values (and the corresponding singular vectors) of a sparse m × n matrix. [sent-22, score-0.212]

6 (2010) are comparable with the information theoretic lower bound: roughly nr max{r, log n} random entries are needed to reconstruct M exactly (here we assume m of order n). [sent-24, score-0.296]

7 1 Model Deﬁnition Let M be an m × n matrix of rank r, that is M = UΣV T . [sent-37, score-0.162]

8 We assume that each entry of M is perturbed, thus producing an ‘approximately’ low-rank matrix N, with Ni j = Mi j + Zi j , where the matrix Z will be assumed to be ‘small’ in an appropriate sense. [sent-39, score-0.178]

9 Out of the m × n entries of N, a subset E ⊆ [m] × [n] is revealed. [sent-40, score-0.141]

10 We let N E be the m × n matrix that contains the revealed entries of N, and is ﬁlled with 0’s in the other positions NiE = j Ni j if (i, j) ∈ E , 0 otherwise. [sent-41, score-0.319]

11 Analogously, we let M E and Z E be the m × n matrices that contain the entries of M and Z, respectively, in the revealed positions and is ﬁlled with 0’s in the other positions. [sent-42, score-0.289]

12 The basic idea is to minimize the cost function F(X,Y ), deﬁned by F(X,Y ) ≡ F (X,Y, S) ≡ min F (X,Y, S) , S∈Rr×r (2) 1 (Ni j − (XSY T )i j )2 . [sent-47, score-0.139]

13 The key insight is that the singular value decomposition (SVD) of N E provides an excellent initial guess, and that the minimum can be found with high probability by standard gradient descent after this initialization. [sent-50, score-0.188]

14 We may note here that the rank of the matrix M, if not known, can be reliably estimated from N E (Keshavan and Oh, 2009). [sent-53, score-0.162]

15 We say that a row is ‘over-represented’ if it contains more than 2|E|/m revealed entries (i. [sent-56, score-0.249]

16 , more than twice the average number of revealed entries per row). [sent-58, score-0.249]

17 The trimmed matrix N E is obtained from N E by setting to 0 over-represented rows and columns. [sent-60, score-0.228]

18 Let min(m,n) ∑ NE = σi xi yT , i i=1 be the singular value decomposition of N E , with singular values σ1 ≥ σ2 ≥ . [sent-62, score-0.212]

19 We then deﬁne Pr (N E ) = mn |E| r ∑ σi xi yT . [sent-66, score-0.218]

20 For further details on optimization by gradient descent on matrix manifolds we refer to Edelman et al. [sent-78, score-0.152]

21 Probability is taken with respect to the uniformly random subset E ⊆ [m] × [n] given |E| and √ (eventually) the noise matrix Z. [sent-96, score-0.162]

22 In the case when m = n, ε corresponds to the average number of revealed entries per row or column. [sent-98, score-0.249]

23 Then it is convenient to work with a model √ in which each entry is revealed independently with probability ε/ mn. [sent-99, score-0.146]

24 We deﬁne Z E to be an m × n matrix obtained from Z E , after the trimming step of the pseudocode above, that is, by setting to zero the over-represented rows and columns. [sent-125, score-0.257]

25 1 Let N = M + Z, where M has rank r, and assume that the subset of revealed entries E ⊆ [m] × [n] is uniformly random with size |E|. [sent-127, score-0.37]

26 Then there exists numerical constants C and C′ such that 1 √ M − Pr (N E ) mn F ≤ CMmax nrα3/2 |E| 2060 1/2 √ rα E +C Z |E| ′n 2, M ATRIX C OMPLETION FROM N OISY E NTRIES with probability larger than 1 − 1/n3 . [sent-129, score-0.29]

27 At a high-level, projection onto rank-r matrices can be interpreted as ‘treat missing entries as zeros’. [sent-131, score-0.211]

28 This theorem shows that this approach is reasonably robust if the number of observed entries is as large as the number of degrees of freedom (which is about (m + n)r) times a large constant. [sent-132, score-0.189]

29 Let us stress that trimming is crucial for achieving this guarantee. [sent-134, score-0.184]

30 2 Let N = M + Z, where M is a (µ0 , µ1 )-incoherent matrix of rank r, and assume that the subset of revealed entries E ⊆ [m] × [n] is uniformly random with size |E|. [sent-146, score-0.44]

31 Then there exists numerical constants C and C′ such that if √ √ |E| ≥ Cn ακ2 max µ0 r α log n ; µ2 r2 ακ4 ; µ2 r2 ακ4 , 0 1 then, with probability at least 1 − 1/n3 , 1 √ M−M mn F √ rα E Z |E| ′ 2n ≤C κ provided that the right-hand side is smaller than Σmin . [sent-149, score-0.431]

32 5 Noise Models In order to make sense of the above results, it is convenient to consider a couple of simple models for the noise matrix Z: Independent entries model. [sent-153, score-0.274]

33 3 If Z is a random matrix drawn according to the independent entries model, then for any sample size |E| there is a constant C such that, ZE 2 ≤ Cσ |E| log n n 1/2 , (4) with probability at least 1 − 1/n3 . [sent-163, score-0.257]

34 Then, among the other things, this result implies that for the independent entries model the right-hand side of our error estimate, Eq. [sent-169, score-0.141]

35 Our result however applies to any number of revealed entries, while the one of Achlioptas and McSherry (2007) requires |E| ≥ (8 log n)4 n (which for n ≤ 5 · 108 is larger than n2 ). [sent-175, score-0.154]

36 √ Figures 1 and 2 compare the average root mean square error M − M F / mn for the two algorithms as a function of |E| and the rank-r respectively. [sent-181, score-0.294]

37 Here M is a random rank r matrix of √ dimension m = n = 600, generated by letting M = U V T with Ui j , Vi j i. [sent-182, score-0.162]

38 These examples are taken from Cand` s and Plan (2009, Figure e 2062 M ATRIX C OMPLETION FROM N OISY E NTRIES Convex Relaxation Lower Bound rank-r projection OptSpace : 1 iteration 2 iterations 3 iterations 10 iterations 1 RMSE 0. [sent-189, score-0.156]

39 Root mean square error achieved by O PT S PACE is shown as a function of the number of observed entries |E| and of the number of line minimizations. [sent-194, score-0.174]

40 The performance of nuclear norm minimization and an information theoretic lower bound are also shown. [sent-195, score-0.247]

41 2 1 2 3 4 5 6 7 8 9 10 Rank Figure 2: Numerical simulation with random rank-r 600 × 600 matrices and number of observed entries |E|/n = 120. [sent-200, score-0.181]

42 The performance of nuclear norm minimization and an information theoretic lower bound are also shown. [sent-202, score-0.247]

43 0001 0 5 10 15 20 25 30 35 40 45 50 Iterations Figure 3: Numerical simulation with random rank-2 600 × 600 matrices and number of observed entries |E|/n = 80 and 160. [sent-207, score-0.181]

44 2), from which we took the data points for the convex relaxation approach, as well as the information theoretic lower bound described later in this section. [sent-215, score-0.174]

45 In about 10 iterations it becomes indistinguishable from the information theoretic lower bound for small ranks. [sent-217, score-0.159]

46 Two metrics, root mean squared error(RMSE) and ﬁt error PE (M − N) F / |E|, are shown as functions of the number of iterations in the manifold optimization step. [sent-219, score-0.16]

47 For a more complete numerical comparison between various algorithms for matrix completion, including different noise models, real data sets and ill conditioned matrices, we refer to Keshavan and Oh (2009). [sent-230, score-0.169]

48 As far as the error bound is concerned, Cand` s and Plan (2009) proved that the semideﬁnite programming approach e returns an estimate M which satisﬁes 1 √ MSDP − M mn F n ZE |E| ≤7 F 2 + √ ZE n α F. [sent-234, score-0.269]

49 For instance, within the independent entries model with bounded variance σ, Z E F = Θ( |E|) while Z E 2 is of order |E|/n (up to logarithmic terms). [sent-241, score-0.141]

50 Suppose, for simplicity, m = n and assume that an oracle provides us a linear subspace T where the correct rank r matrix M = UΣV T lies. [sent-244, score-0.162]

51 The minimum mean square error estimator is computed by projecting the revealed entries onto the subspace T , which can be done by solving a least squares problem. [sent-247, score-0.282]

52 Cand` s and Plan (2009) analyzed the root e mean squared error of the resulting estimator M and showed that 1 √ MOracle − M mn F ≈ 1 ZE |E| F. [sent-248, score-0.261]

53 In this case the oracle estimator yields (for r = o(n)) 1 √ MOracle − M mn F ≈σ 2nr . [sent-254, score-0.218]

54 |E| The bound (6) on the semideﬁnite programming approach yields 1 √ MSDP − M mn F ≤σ 7 2 n|E| + |E| . [sent-255, score-0.269]

55 3 we deduce that O PT S PACE achieves 1 √ MOptSpace − M mn F ≤σ C nr . [sent-258, score-0.261]

56 An objective function analogous to the one used in the present paper was considered early on in Srebro and Jaakkola (2003), which uses gradient descent in the factors to minimize a weighted sum of square residuals. [sent-270, score-0.146]

57 The basic relation is provided by the identity M ∗ = 1 min 2 M=XY T X 2 F + Y 2 F , (7) where M ∗ denotes the nuclear norm of M (the sum of its singular values). [sent-283, score-0.375]

58 8 On the Spectrum of Sparse Matrices and the Role of Trimming The trimming step of the O PT S PACE algorithm is somewhat counter-intuitive in that we seem to be wasting information. [sent-292, score-0.152]

59 One might for instance rescale the entries of such rows/columns. [sent-295, score-0.141]

60 We stick to trimming because we can prove it actually works. [sent-296, score-0.152]

61 Assume, for the sake of simplicity, that m = n, there is no noise in the revealed entries, and M is the rank one matrix with Mi j = 1 for all i and j. [sent-298, score-0.333]

62 The number of non-zero entries in a column is Binomial(n, ε/n) and is independent for different columns. [sent-303, score-0.141]

63 It is not hard to realize that the column with the largest number of entries has more than C log n/ log log n entries, with positive probability (this probability can be made as large as we want by reducing C). [sent-304, score-0.279]

64 By computing M E e(i) , we conclude that the largest singular value of M E is at least C log n/ log log n. [sent-306, score-0.244]

65 Also, the phenomenon is more severe in real data sets than in the present model, where each entry is revealed independently. [sent-311, score-0.146]

66 log log n ≥ C′ (ε) Zmax This suggests that the largest singular value of the noise matrix Z E is quite different from the largest singular value of E{Z E } which is εZmax . [sent-315, score-0.437]

67 3 (for the worst case model) simply do not hold without trimming or a similar procedure to normalize rows/columns of N E . [sent-318, score-0.188]

68 1 As explained in the introduction, the crucial idea is to consider the singular value decomposition of the trimmed matrix N E instead of the original matrix N E . [sent-322, score-0.369]

69 Apart from a trivial rescaling, these singular values are close to the ones of the original matrix M. [sent-323, score-0.176]

70 2067 2 , K ESHAVAN , M ONTANARI AND O H Proof For any matrix A, let σq (A) denote the qth singular value of A. [sent-325, score-0.176]

71 Lemma 2 (Keshavan, Montanari, Oh, 2009) There exists a numerical constant C such that, with probability larger than 1 − 1/n3 , √ mn E α 1 √ ≤ CMmax M− M . [sent-328, score-0.254]

72 In particular, we will use this to bound the distance between the original matrix M = UΣV T and the starting point of the manifold optimization M = X0 S0Y0T . [sent-347, score-0.196]

73 (2), it is also convenient to introduce the notations d− (x, u) ≡ Σ2 d(x, u)2 + S − Σ min 2 , F d+ (x, u) ≡ Σ2 d(x, u)2 + S − Σ max 2 . [sent-356, score-0.234]

74 Then, grad F(x) 2 2 ≥ C1 nε Σ4 min d(x, u) −C2 √ rΣmax Z E 2 εΣmin Σmin 2 , (12) + for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-371, score-0.277]

75 (8) and (9) we get √ F(x) − F(u) ≥ C1 nε αΣ2 d(x, u)2 − δ2 0,− , min √ 2 F(x) − F(u) ≤ C2 nε αΣmax d(x, u)2 + δ2 0,+ , (14) (15) with C1 and C2 different from those in Eqs. [sent-384, score-0.139]

76 (13) with large enough C, we have δ0,− ≤ δ/20 max min and δ0,+ ≤ (δ/20)(Σmin /Σmax ). [sent-416, score-0.234]

77 F(x) ≥ F(u) +C1 nε αΣ2 min 400 Hence, for all xk such that d(xk , u) ∈ [δ/10, δ], we have F(x) ≥ F(x) ≥ F(x0 ). [sent-421, score-0.192]

78 Since the cost function is twice differentiable, and because of the above two claims, the sequence {xk } converges to Ω = x ∈ K (4µ0 ) ∩ M(m, n) : d(x, u) ≤ δ , grad F(x) = 0 . [sent-423, score-0.138]

79 (18) and (16), this implies √ 2 rΣmax Z E 2 1 T √ M − XSY F ≤ C , n α εΣ2 min which ﬁnishes the proof of Theorem 1. [sent-427, score-0.178]

80 It is therefore sufﬁcient to lower bound the scalar product grad F, w . [sent-465, score-0.189]

81 (2010), √ grad F0 (x), w ≥ Cnε αΣ2 d(x, u)2 min (23) (see Lemma 9 in Appendix). [sent-468, score-0.277]

82 (21) and (23) this implies √ √ 2 rΣmax Z E 2 , grad F(x), w ≥ C1 nε αΣmin d(x, u) d(x, u) −C2 εΣmin Σmin which implies Eq. [sent-476, score-0.138]

83 3 Proof (Independent entries model ) We start with a claim that for any sampling set E, we have ZE 2 ≤ ZE 2 . [sent-480, score-0.141]

84 Recall that, as a result of the trimming step, all the entries in trimmed rows and columns of Z E are set to zero. [sent-482, score-0.451]

85 From this observation, it follows that x∗T Z E y∗ = x∗T Z E y∗ , since the trimmed j E and the sample noise matrix Z E only differ in the trimmed rows and columns. [sent-485, score-0.414]

86 Indeed, for any M and M ′ , M ′ 2 − M 2 ≤ M ′ − M 2 ≤ M ′ − M F , where the ﬁrst inequality follows from triangular inequality and the second inequality follows from the fact that · 2 is the sum of the squared F singular values. [sent-501, score-0.253]

87 To this end, we apply the following bound on expected norm of random matrices with i. [sent-516, score-0.146]

88 E ZE 2 E ≤ C E max Zi• E + E max Z• j i∈[m] j∈[n] , (26) E E where Zi• and Z• j denote the ith row and jth column of A respectively. [sent-521, score-0.19]

89 ∞ √ √ E E E max Z• j 2 ≤ βσ2 ε α + P max Z• j 2 ≥ βσ2 ε α + z dz . [sent-523, score-0.19]

90 (27) j j 0 To bound the second term, we can apply union bound on each of the n columns, and use the followE ing bound on each column Z• j 2 resulting from concentration of measure inequality for the i. [sent-524, score-0.192]

91 Recall that Zk j = ξk j Zk j where ξ’s are independent Bernoulli random variables such that √ √ ˜ ξ = 1 with probability ε/ mn and zero with probability 1 − ε/ mn. [sent-530, score-0.218]

92 Since E E max j Z• j ≤ E E max j Z• j 2 , applying Eq. [sent-539, score-0.19]

93 Then for any matrix Z from the worst case model, we have Z E 2 ≤ Zmax DE 2 , since xT Z E y ≤ ∑i, j Zmax |xi |DEj |y j |, which follows from i E is an adjacency matrix of a corresponding the fact that Zi j ’s are uniformly bounded. [sent-553, score-0.205]

94 Then, √ √ C1 α Σ2 d(x, u)2 +C1 α S0 − Σ min 2 F ≤ √ 1 F0 (x) ≤ C2 αΣ2 d(x, u)2 , max nε for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-564, score-0.234]

95 Then grad F0 (x) 2 ≥ C nε2 Σ4 d(x, u)2 , min for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-569, score-0.277]

96 Under the hypothesis of Lemma 8 √ grad F0 (x), w ≥ C nε α Σ2 d(x, u)2 , min for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-573, score-0.277]

97 Optspace: A gradient descent algorithm on the grassman manifold for matrix completion. [sent-643, score-0.227]

98 Fixed point and Bregman iterative methods for matrix rank minimization. [sent-665, score-0.162]

99 Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. [sent-672, score-0.292]

100 An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. [sent-736, score-0.165]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('keshavan', 0.323), ('ze', 0.221), ('mn', 0.218), ('cand', 0.214), ('eshavan', 0.198), ('ontanari', 0.198), ('xsy', 0.198), ('pe', 0.195), ('ntries', 0.18), ('xsqt', 0.18), ('ompletion', 0.154), ('trimming', 0.152), ('pace', 0.149), ('pt', 0.143), ('entries', 0.141), ('min', 0.139), ('oisy', 0.138), ('grad', 0.138), ('zmax', 0.126), ('trimmed', 0.123), ('montanari', 0.111), ('revealed', 0.108), ('ziej', 0.108), ('sy', 0.108), ('singular', 0.106), ('plan', 0.101), ('atrix', 0.101), ('max', 0.095), ('rank', 0.092), ('cmmax', 0.09), ('rmse', 0.089), ('oh', 0.081), ('arxiv', 0.078), ('manifold', 0.075), ('nuclear', 0.075), ('mmax', 0.072), ('zi', 0.07), ('matrix', 0.07), ('theoretic', 0.066), ('noise', 0.063), ('completion', 0.063), ('salakhutdinov', 0.058), ('relaxation', 0.057), ('norm', 0.055), ('optspace', 0.054), ('seginer', 0.054), ('rr', 0.053), ('xk', 0.053), ('zk', 0.052), ('mi', 0.051), ('bound', 0.051), ('lemma', 0.048), ('theorem', 0.048), ('recht', 0.048), ('descent', 0.047), ('ni', 0.047), ('srebro', 0.046), ('log', 0.046), ('achlioptas', 0.046), ('stanford', 0.045), ('root', 0.043), ('yt', 0.043), ('rn', 0.043), ('nr', 0.043), ('pr', 0.043), ('iterations', 0.042), ('cn', 0.042), ('incoherence', 0.042), ('collaborative', 0.041), ('matrices', 0.04), ('inequality', 0.039), ('proof', 0.039), ('fazel', 0.038), ('entry', 0.038), ('numerical', 0.036), ('worst', 0.036), ('moracle', 0.036), ('msdp', 0.036), ('raghunandan', 0.036), ('sewoong', 0.036), ('uik', 0.036), ('constants', 0.036), ('fit', 0.036), ('rows', 0.035), ('gradient', 0.035), ('analogously', 0.034), ('lipschitz', 0.034), ('rm', 0.034), ('square', 0.033), ('stress', 0.032), ('analogous', 0.031), ('proves', 0.031), ('lemmas', 0.031), ('mcsherry', 0.031), ('frieze', 0.031), ('whence', 0.031), ('triangular', 0.03), ('projection', 0.03), ('uniformly', 0.029), ('svd', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

2 0.1695682 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efﬁcient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm S OFT-I MPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efﬁciently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semideﬁnite-programming algorithm is readily scalable to large matrices; for example S OFT-I MPUTE takes a few hours to compute low-rank approximations of a 106 × 106 incomplete matrix with 107 observed entries, and ﬁts a rank-95 approximation to the full Netﬂix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords: collaborative ﬁltering, nuclear norm, spectral regularization, netﬂix prize, large scale convex optimization

3 0.13139808 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative ﬁltering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization

4 0.11248519 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

5 0.093927905 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

Author: Dapo Omidiran, Martin J. Wainwright

Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A signiﬁcant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsiﬁed measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efﬁciency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efﬁciency as dense ensembles. A variety of simulation results conﬁrm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization

6 0.088903084 27 jmlr-2010-Consistent Nonparametric Tests of Independence

7 0.086302079 82 jmlr-2010-On Learning with Integral Operators

8 0.076270141 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

9 0.070227638 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

10 0.057240371 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

11 0.051123925 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

12 0.050688554 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

13 0.050586648 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

14 0.047482621 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

15 0.046673786 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

16 0.045334995 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

17 0.044258986 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

18 0.042976808 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

19 0.041828468 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

20 0.041245263 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.218), (1, -0.151), (2, 0.144), (3, -0.043), (4, -0.171), (5, -0.102), (6, -0.035), (7, -0.104), (8, 0.076), (9, -0.064), (10, -0.112), (11, 0.123), (12, 0.003), (13, 0.046), (14, 0.106), (15, -0.131), (16, -0.047), (17, -0.087), (18, -0.14), (19, 0.142), (20, -0.093), (21, 0.037), (22, 0.227), (23, 0.059), (24, 0.183), (25, -0.208), (26, 0.022), (27, 0.053), (28, 0.002), (29, -0.079), (30, 0.029), (31, -0.008), (32, -0.055), (33, 0.01), (34, 0.064), (35, 0.007), (36, -0.199), (37, -0.014), (38, 0.058), (39, 0.095), (40, -0.037), (41, -0.094), (42, 0.007), (43, 0.037), (44, -0.054), (45, -0.055), (46, 0.139), (47, -0.004), (48, 0.026), (49, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92195994 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

2 0.89172298 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani

3 0.6468032 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

4 0.41730785 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

5 0.40421787 82 jmlr-2010-On Learning with Integral Operators

Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito

Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators deﬁned by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simpliﬁed proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods

6 0.38409016 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency

7 0.31571192 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs

8 0.30986947 27 jmlr-2010-Consistent Nonparametric Tests of Independence

9 0.29852888 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

10 0.27930978 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

11 0.24863143 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

12 0.2413002 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

13 0.23424013 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

14 0.23313877 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

15 0.223066 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

16 0.22002113 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

17 0.21843247 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

18 0.21556062 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

19 0.21224818 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

20 0.21124648 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.547), (8, 0.019), (15, 0.01), (21, 0.017), (32, 0.047), (36, 0.02), (37, 0.042), (51, 0.016), (75, 0.133), (81, 0.017), (85, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89142227 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures

Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet

Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be deﬁned as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that deﬁnes the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difﬁcult to check, our conditions are straightforward and intuitive: integrally strictly positive deﬁnite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und

same-paper 2 0.80486804 72 jmlr-2010-Matrix Completion from Noisy Entries

Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh

3 0.48729491 27 jmlr-2010-Consistent Nonparametric Tests of Independence

Author: Arthur Gretton, László Györfi

Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are deﬁned when the empirical distribution of the variables is restricted to ﬁnite partitions. A third test statistic is deﬁned as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a ﬁxed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test

4 0.48016688 84 jmlr-2010-On Spectral Learning

Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil

5 0.47784224 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

Author: Liva Ralaivola, Marie Szafranski, Guillaume Stempfel

Abstract: PAC-Bayes bounds are among the most accurate generalization bounds for classiﬁers learned from independently and identically distributed (IID) data, and it is particularly so for margin classiﬁers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classiﬁers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the ﬁrst—to the best of our knowledge—PAC-Bayes generalization bounds for classiﬁers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to ﬁnd an upper bound on the fractional chromatic number of the dependency graph is sufﬁcient to get new PAC-Bayes bounds for speciﬁc settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC) and classiﬁers trained on data distributed according to a stationary β-mixing process. In the way, we show how our approach seamlessly allows us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization bound for classiﬁers learned on data from stationary ϕ-mixing distributions. Keywords: PAC-Bayes bounds, non IID data, ranking, U-statistics, mixing processes c 2010 Liva Ralaivola, Marie Szafranski and Guillaume Stempfel. R ALAIVOLA , S ZAFRANSKI AND S TEMPFEL

6 0.47628227 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

7 0.47146821 82 jmlr-2010-On Learning with Integral Operators

8 0.46456 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

9 0.45803165 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

10 0.44353998 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

11 0.42773855 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls

12 0.42108393 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

13 0.41338947 60 jmlr-2010-Learnability, Stability and Uniform Convergence

14 0.40757242 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

15 0.40731481 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

16 0.40289754 25 jmlr-2010-Composite Binary Losses

17 0.39404744 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

18 0.3882736 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate

19 0.38137105 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

20 0.38023421 102 jmlr-2010-Semi-Supervised Novelty Detection