jmlr jmlr2010 jmlr2010-72 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
Reference: text
sentIndex sentText sentNum sentScore
1 Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization 1. [sent-9, score-0.145]
2 Given a matrix M, its largest singular values—and the associated singular vectors—‘explain’ the most significant correlations in the underlying data source. [sent-11, score-0.282]
3 In many practical circumstances we have access only to a sparse subset of the entries of an m × n matrix M. [sent-14, score-0.211]
4 It has recently been discovered that, if the matrix M has rank r, and unless it is too ‘structured’, a small random subset of its entries allow to reconstruct it exactly. [sent-15, score-0.303]
5 O PT S PACE is intrinsically of low complexity, the most complex operation being computing r singular values (and the corresponding singular vectors) of a sparse m × n matrix. [sent-22, score-0.212]
6 (2010) are comparable with the information theoretic lower bound: roughly nr max{r, log n} random entries are needed to reconstruct M exactly (here we assume m of order n). [sent-24, score-0.296]
7 1 Model Definition Let M be an m × n matrix of rank r, that is M = UΣV T . [sent-37, score-0.162]
8 We assume that each entry of M is perturbed, thus producing an ‘approximately’ low-rank matrix N, with Ni j = Mi j + Zi j , where the matrix Z will be assumed to be ‘small’ in an appropriate sense. [sent-39, score-0.178]
9 Out of the m × n entries of N, a subset E ⊆ [m] × [n] is revealed. [sent-40, score-0.141]
10 We let N E be the m × n matrix that contains the revealed entries of N, and is filled with 0’s in the other positions NiE = j Ni j if (i, j) ∈ E , 0 otherwise. [sent-41, score-0.319]
11 Analogously, we let M E and Z E be the m × n matrices that contain the entries of M and Z, respectively, in the revealed positions and is filled with 0’s in the other positions. [sent-42, score-0.289]
12 The basic idea is to minimize the cost function F(X,Y ), defined by F(X,Y ) ≡ F (X,Y, S) ≡ min F (X,Y, S) , S∈Rr×r (2) 1 (Ni j − (XSY T )i j )2 . [sent-47, score-0.139]
13 The key insight is that the singular value decomposition (SVD) of N E provides an excellent initial guess, and that the minimum can be found with high probability by standard gradient descent after this initialization. [sent-50, score-0.188]
14 We may note here that the rank of the matrix M, if not known, can be reliably estimated from N E (Keshavan and Oh, 2009). [sent-53, score-0.162]
15 We say that a row is ‘over-represented’ if it contains more than 2|E|/m revealed entries (i. [sent-56, score-0.249]
16 , more than twice the average number of revealed entries per row). [sent-58, score-0.249]
17 The trimmed matrix N E is obtained from N E by setting to 0 over-represented rows and columns. [sent-60, score-0.228]
18 Let min(m,n) ∑ NE = σi xi yT , i i=1 be the singular value decomposition of N E , with singular values σ1 ≥ σ2 ≥ . [sent-62, score-0.212]
19 We then define Pr (N E ) = mn |E| r ∑ σi xi yT . [sent-66, score-0.218]
20 For further details on optimization by gradient descent on matrix manifolds we refer to Edelman et al. [sent-78, score-0.152]
21 Probability is taken with respect to the uniformly random subset E ⊆ [m] × [n] given |E| and √ (eventually) the noise matrix Z. [sent-96, score-0.162]
22 In the case when m = n, ε corresponds to the average number of revealed entries per row or column. [sent-98, score-0.249]
23 Then it is convenient to work with a model √ in which each entry is revealed independently with probability ε/ mn. [sent-99, score-0.146]
24 We define Z E to be an m × n matrix obtained from Z E , after the trimming step of the pseudocode above, that is, by setting to zero the over-represented rows and columns. [sent-125, score-0.257]
25 1 Let N = M + Z, where M has rank r, and assume that the subset of revealed entries E ⊆ [m] × [n] is uniformly random with size |E|. [sent-127, score-0.37]
26 Then there exists numerical constants C and C′ such that 1 √ M − Pr (N E ) mn F ≤ CMmax nrα3/2 |E| 2060 1/2 √ rα E +C Z |E| ′n 2, M ATRIX C OMPLETION FROM N OISY E NTRIES with probability larger than 1 − 1/n3 . [sent-129, score-0.29]
27 At a high-level, projection onto rank-r matrices can be interpreted as ‘treat missing entries as zeros’. [sent-131, score-0.211]
28 This theorem shows that this approach is reasonably robust if the number of observed entries is as large as the number of degrees of freedom (which is about (m + n)r) times a large constant. [sent-132, score-0.189]
29 Let us stress that trimming is crucial for achieving this guarantee. [sent-134, score-0.184]
30 2 Let N = M + Z, where M is a (µ0 , µ1 )-incoherent matrix of rank r, and assume that the subset of revealed entries E ⊆ [m] × [n] is uniformly random with size |E|. [sent-146, score-0.44]
31 Then there exists numerical constants C and C′ such that if √ √ |E| ≥ Cn ακ2 max µ0 r α log n ; µ2 r2 ακ4 ; µ2 r2 ακ4 , 0 1 then, with probability at least 1 − 1/n3 , 1 √ M−M mn F √ rα E Z |E| ′ 2n ≤C κ provided that the right-hand side is smaller than Σmin . [sent-149, score-0.431]
32 5 Noise Models In order to make sense of the above results, it is convenient to consider a couple of simple models for the noise matrix Z: Independent entries model. [sent-153, score-0.274]
33 3 If Z is a random matrix drawn according to the independent entries model, then for any sample size |E| there is a constant C such that, ZE 2 ≤ Cσ |E| log n n 1/2 , (4) with probability at least 1 − 1/n3 . [sent-163, score-0.257]
34 Then, among the other things, this result implies that for the independent entries model the right-hand side of our error estimate, Eq. [sent-169, score-0.141]
35 Our result however applies to any number of revealed entries, while the one of Achlioptas and McSherry (2007) requires |E| ≥ (8 log n)4 n (which for n ≤ 5 · 108 is larger than n2 ). [sent-175, score-0.154]
36 √ Figures 1 and 2 compare the average root mean square error M − M F / mn for the two algorithms as a function of |E| and the rank-r respectively. [sent-181, score-0.294]
37 Here M is a random rank r matrix of √ dimension m = n = 600, generated by letting M = U V T with Ui j , Vi j i. [sent-182, score-0.162]
38 These examples are taken from Cand` s and Plan (2009, Figure e 2062 M ATRIX C OMPLETION FROM N OISY E NTRIES Convex Relaxation Lower Bound rank-r projection OptSpace : 1 iteration 2 iterations 3 iterations 10 iterations 1 RMSE 0. [sent-189, score-0.156]
39 Root mean square error achieved by O PT S PACE is shown as a function of the number of observed entries |E| and of the number of line minimizations. [sent-194, score-0.174]
40 The performance of nuclear norm minimization and an information theoretic lower bound are also shown. [sent-195, score-0.247]
41 2 1 2 3 4 5 6 7 8 9 10 Rank Figure 2: Numerical simulation with random rank-r 600 × 600 matrices and number of observed entries |E|/n = 120. [sent-200, score-0.181]
42 The performance of nuclear norm minimization and an information theoretic lower bound are also shown. [sent-202, score-0.247]
43 0001 0 5 10 15 20 25 30 35 40 45 50 Iterations Figure 3: Numerical simulation with random rank-2 600 × 600 matrices and number of observed entries |E|/n = 80 and 160. [sent-207, score-0.181]
44 2), from which we took the data points for the convex relaxation approach, as well as the information theoretic lower bound described later in this section. [sent-215, score-0.174]
45 In about 10 iterations it becomes indistinguishable from the information theoretic lower bound for small ranks. [sent-217, score-0.159]
46 Two metrics, root mean squared error(RMSE) and fit error PE (M − N) F / |E|, are shown as functions of the number of iterations in the manifold optimization step. [sent-219, score-0.16]
47 For a more complete numerical comparison between various algorithms for matrix completion, including different noise models, real data sets and ill conditioned matrices, we refer to Keshavan and Oh (2009). [sent-230, score-0.169]
48 As far as the error bound is concerned, Cand` s and Plan (2009) proved that the semidefinite programming approach e returns an estimate M which satisfies 1 √ MSDP − M mn F n ZE |E| ≤7 F 2 + √ ZE n α F. [sent-234, score-0.269]
49 For instance, within the independent entries model with bounded variance σ, Z E F = Θ( |E|) while Z E 2 is of order |E|/n (up to logarithmic terms). [sent-241, score-0.141]
50 Suppose, for simplicity, m = n and assume that an oracle provides us a linear subspace T where the correct rank r matrix M = UΣV T lies. [sent-244, score-0.162]
51 The minimum mean square error estimator is computed by projecting the revealed entries onto the subspace T , which can be done by solving a least squares problem. [sent-247, score-0.282]
52 Cand` s and Plan (2009) analyzed the root e mean squared error of the resulting estimator M and showed that 1 √ MOracle − M mn F ≈ 1 ZE |E| F. [sent-248, score-0.261]
53 In this case the oracle estimator yields (for r = o(n)) 1 √ MOracle − M mn F ≈σ 2nr . [sent-254, score-0.218]
54 |E| The bound (6) on the semidefinite programming approach yields 1 √ MSDP − M mn F ≤σ 7 2 n|E| + |E| . [sent-255, score-0.269]
55 3 we deduce that O PT S PACE achieves 1 √ MOptSpace − M mn F ≤σ C nr . [sent-258, score-0.261]
56 An objective function analogous to the one used in the present paper was considered early on in Srebro and Jaakkola (2003), which uses gradient descent in the factors to minimize a weighted sum of square residuals. [sent-270, score-0.146]
57 The basic relation is provided by the identity M ∗ = 1 min 2 M=XY T X 2 F + Y 2 F , (7) where M ∗ denotes the nuclear norm of M (the sum of its singular values). [sent-283, score-0.375]
58 8 On the Spectrum of Sparse Matrices and the Role of Trimming The trimming step of the O PT S PACE algorithm is somewhat counter-intuitive in that we seem to be wasting information. [sent-292, score-0.152]
59 One might for instance rescale the entries of such rows/columns. [sent-295, score-0.141]
60 We stick to trimming because we can prove it actually works. [sent-296, score-0.152]
61 Assume, for the sake of simplicity, that m = n, there is no noise in the revealed entries, and M is the rank one matrix with Mi j = 1 for all i and j. [sent-298, score-0.333]
62 The number of non-zero entries in a column is Binomial(n, ε/n) and is independent for different columns. [sent-303, score-0.141]
63 It is not hard to realize that the column with the largest number of entries has more than C log n/ log log n entries, with positive probability (this probability can be made as large as we want by reducing C). [sent-304, score-0.279]
64 By computing M E e(i) , we conclude that the largest singular value of M E is at least C log n/ log log n. [sent-306, score-0.244]
65 Also, the phenomenon is more severe in real data sets than in the present model, where each entry is revealed independently. [sent-311, score-0.146]
66 log log n ≥ C′ (ε) Zmax This suggests that the largest singular value of the noise matrix Z E is quite different from the largest singular value of E{Z E } which is εZmax . [sent-315, score-0.437]
67 3 (for the worst case model) simply do not hold without trimming or a similar procedure to normalize rows/columns of N E . [sent-318, score-0.188]
68 1 As explained in the introduction, the crucial idea is to consider the singular value decomposition of the trimmed matrix N E instead of the original matrix N E . [sent-322, score-0.369]
69 Apart from a trivial rescaling, these singular values are close to the ones of the original matrix M. [sent-323, score-0.176]
70 2067 2 , K ESHAVAN , M ONTANARI AND O H Proof For any matrix A, let σq (A) denote the qth singular value of A. [sent-325, score-0.176]
71 Lemma 2 (Keshavan, Montanari, Oh, 2009) There exists a numerical constant C such that, with probability larger than 1 − 1/n3 , √ mn E α 1 √ ≤ CMmax M− M . [sent-328, score-0.254]
72 In particular, we will use this to bound the distance between the original matrix M = UΣV T and the starting point of the manifold optimization M = X0 S0Y0T . [sent-347, score-0.196]
73 (2), it is also convenient to introduce the notations d− (x, u) ≡ Σ2 d(x, u)2 + S − Σ min 2 , F d+ (x, u) ≡ Σ2 d(x, u)2 + S − Σ max 2 . [sent-356, score-0.234]
74 Then, grad F(x) 2 2 ≥ C1 nε Σ4 min d(x, u) −C2 √ rΣmax Z E 2 εΣmin Σmin 2 , (12) + for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-371, score-0.277]
75 (8) and (9) we get √ F(x) − F(u) ≥ C1 nε αΣ2 d(x, u)2 − δ2 0,− , min √ 2 F(x) − F(u) ≤ C2 nε αΣmax d(x, u)2 + δ2 0,+ , (14) (15) with C1 and C2 different from those in Eqs. [sent-384, score-0.139]
76 (13) with large enough C, we have δ0,− ≤ δ/20 max min and δ0,+ ≤ (δ/20)(Σmin /Σmax ). [sent-416, score-0.234]
77 F(x) ≥ F(u) +C1 nε αΣ2 min 400 Hence, for all xk such that d(xk , u) ∈ [δ/10, δ], we have F(x) ≥ F(x) ≥ F(x0 ). [sent-421, score-0.192]
78 Since the cost function is twice differentiable, and because of the above two claims, the sequence {xk } converges to Ω = x ∈ K (4µ0 ) ∩ M(m, n) : d(x, u) ≤ δ , grad F(x) = 0 . [sent-423, score-0.138]
79 (18) and (16), this implies √ 2 rΣmax Z E 2 1 T √ M − XSY F ≤ C , n α εΣ2 min which finishes the proof of Theorem 1. [sent-427, score-0.178]
80 It is therefore sufficient to lower bound the scalar product grad F, w . [sent-465, score-0.189]
81 (2010), √ grad F0 (x), w ≥ Cnε αΣ2 d(x, u)2 min (23) (see Lemma 9 in Appendix). [sent-468, score-0.277]
82 (21) and (23) this implies √ √ 2 rΣmax Z E 2 , grad F(x), w ≥ C1 nε αΣmin d(x, u) d(x, u) −C2 εΣmin Σmin which implies Eq. [sent-476, score-0.138]
83 3 Proof (Independent entries model ) We start with a claim that for any sampling set E, we have ZE 2 ≤ ZE 2 . [sent-480, score-0.141]
84 Recall that, as a result of the trimming step, all the entries in trimmed rows and columns of Z E are set to zero. [sent-482, score-0.451]
85 From this observation, it follows that x∗T Z E y∗ = x∗T Z E y∗ , since the trimmed j E and the sample noise matrix Z E only differ in the trimmed rows and columns. [sent-485, score-0.414]
86 Indeed, for any M and M ′ , M ′ 2 − M 2 ≤ M ′ − M 2 ≤ M ′ − M F , where the first inequality follows from triangular inequality and the second inequality follows from the fact that · 2 is the sum of the squared F singular values. [sent-501, score-0.253]
87 To this end, we apply the following bound on expected norm of random matrices with i. [sent-516, score-0.146]
88 E ZE 2 E ≤ C E max Zi• E + E max Z• j i∈[m] j∈[n] , (26) E E where Zi• and Z• j denote the ith row and jth column of A respectively. [sent-521, score-0.19]
89 ∞ √ √ E E E max Z• j 2 ≤ βσ2 ε α + P max Z• j 2 ≥ βσ2 ε α + z dz . [sent-523, score-0.19]
90 (27) j j 0 To bound the second term, we can apply union bound on each of the n columns, and use the followE ing bound on each column Z• j 2 resulting from concentration of measure inequality for the i. [sent-524, score-0.192]
91 Recall that Zk j = ξk j Zk j where ξ’s are independent Bernoulli random variables such that √ √ ˜ ξ = 1 with probability ε/ mn and zero with probability 1 − ε/ mn. [sent-530, score-0.218]
92 Since E E max j Z• j ≤ E E max j Z• j 2 , applying Eq. [sent-539, score-0.19]
93 Then for any matrix Z from the worst case model, we have Z E 2 ≤ Zmax DE 2 , since xT Z E y ≤ ∑i, j Zmax |xi |DEj |y j |, which follows from i E is an adjacency matrix of a corresponding the fact that Zi j ’s are uniformly bounded. [sent-553, score-0.205]
94 Then, √ √ C1 α Σ2 d(x, u)2 +C1 α S0 − Σ min 2 F ≤ √ 1 F0 (x) ≤ C2 αΣ2 d(x, u)2 , max nε for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-564, score-0.234]
95 Then grad F0 (x) 2 ≥ C nε2 Σ4 d(x, u)2 , min for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-569, score-0.277]
96 Under the hypothesis of Lemma 8 √ grad F0 (x), w ≥ C nε α Σ2 d(x, u)2 , min for all x ∈ M(m, n) ∩ K (4µ0 ) such that d(x, u) ≤ δ, with probability at least 1 − 1/n4 . [sent-573, score-0.277]
97 Optspace: A gradient descent algorithm on the grassman manifold for matrix completion. [sent-643, score-0.227]
98 Fixed point and Bregman iterative methods for matrix rank minimization. [sent-665, score-0.162]
99 Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. [sent-672, score-0.292]
100 An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. [sent-736, score-0.165]
wordName wordTfidf (topN-words)
[('keshavan', 0.323), ('ze', 0.221), ('mn', 0.218), ('cand', 0.214), ('eshavan', 0.198), ('ontanari', 0.198), ('xsy', 0.198), ('pe', 0.195), ('ntries', 0.18), ('xsqt', 0.18), ('ompletion', 0.154), ('trimming', 0.152), ('pace', 0.149), ('pt', 0.143), ('entries', 0.141), ('min', 0.139), ('oisy', 0.138), ('grad', 0.138), ('zmax', 0.126), ('trimmed', 0.123), ('montanari', 0.111), ('revealed', 0.108), ('ziej', 0.108), ('sy', 0.108), ('singular', 0.106), ('plan', 0.101), ('atrix', 0.101), ('max', 0.095), ('rank', 0.092), ('cmmax', 0.09), ('rmse', 0.089), ('oh', 0.081), ('arxiv', 0.078), ('manifold', 0.075), ('nuclear', 0.075), ('mmax', 0.072), ('zi', 0.07), ('matrix', 0.07), ('theoretic', 0.066), ('noise', 0.063), ('completion', 0.063), ('salakhutdinov', 0.058), ('relaxation', 0.057), ('norm', 0.055), ('optspace', 0.054), ('seginer', 0.054), ('rr', 0.053), ('xk', 0.053), ('zk', 0.052), ('mi', 0.051), ('bound', 0.051), ('lemma', 0.048), ('theorem', 0.048), ('recht', 0.048), ('descent', 0.047), ('ni', 0.047), ('srebro', 0.046), ('log', 0.046), ('achlioptas', 0.046), ('stanford', 0.045), ('root', 0.043), ('yt', 0.043), ('rn', 0.043), ('nr', 0.043), ('pr', 0.043), ('iterations', 0.042), ('cn', 0.042), ('incoherence', 0.042), ('collaborative', 0.041), ('matrices', 0.04), ('inequality', 0.039), ('proof', 0.039), ('fazel', 0.038), ('entry', 0.038), ('numerical', 0.036), ('worst', 0.036), ('moracle', 0.036), ('msdp', 0.036), ('raghunandan', 0.036), ('sewoong', 0.036), ('uik', 0.036), ('constants', 0.036), ('fit', 0.036), ('rows', 0.035), ('gradient', 0.035), ('analogously', 0.034), ('lipschitz', 0.034), ('rm', 0.034), ('square', 0.033), ('stress', 0.032), ('analogous', 0.031), ('proves', 0.031), ('lemmas', 0.031), ('mcsherry', 0.031), ('frieze', 0.031), ('whence', 0.031), ('triangular', 0.03), ('projection', 0.03), ('uniformly', 0.029), ('svd', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
2 0.1695682 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani
Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm S OFT-I MPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices; for example S OFT-I MPUTE takes a few hours to compute low-rank approximations of a 106 × 106 incomplete matrix with 107 observed entries, and fits a rank-95 approximation to the full Netflix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords: collaborative filtering, nuclear norm, spectral regularization, netflix prize, large scale convex optimization
3 0.13139808 84 jmlr-2010-On Spectral Learning
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization
4 0.11248519 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
Author: Dapo Omidiran, Martin J. Wainwright
Abstract: We consider the problem of high-dimensional variable selection: given n noisy observations of a k-sparse vector β∗ ∈ R p , estimate the subset of non-zero entries of β∗ . A significant body of work has studied behavior of ℓ1 -relaxations when applied to random measurement matrices that are dense (e.g., Gaussian, Bernoulli). In this paper, we analyze sparsified measurement ensembles, and consider the trade-off between measurement sparsity, as measured by the fraction γ of nonzero entries, and the statistical efficiency, as measured by the minimal number of observations n required for correct variable selection with probability converging to one. Our main result is to prove that it is possible to let the fraction on non-zero entries γ → 0 at some rate, yielding measurement matrices with a vanishing fraction of non-zeros per row, while retaining the same statistical efficiency as dense ensembles. A variety of simulation results confirm the sharpness of our theoretical predictions. Keywords: variable selection, sparse random projections, high-dimensional statistics, Lasso, consistency, ℓ1 -regularization
6 0.088903084 27 jmlr-2010-Consistent Nonparametric Tests of Independence
7 0.086302079 82 jmlr-2010-On Learning with Integral Operators
8 0.076270141 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
9 0.070227638 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
10 0.057240371 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
11 0.051123925 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
12 0.050688554 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning
13 0.050586648 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
14 0.047482621 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
15 0.046673786 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
16 0.045334995 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
17 0.044258986 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
18 0.042976808 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance
19 0.041828468 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
20 0.041245263 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
topicId topicWeight
[(0, -0.218), (1, -0.151), (2, 0.144), (3, -0.043), (4, -0.171), (5, -0.102), (6, -0.035), (7, -0.104), (8, 0.076), (9, -0.064), (10, -0.112), (11, 0.123), (12, 0.003), (13, 0.046), (14, 0.106), (15, -0.131), (16, -0.047), (17, -0.087), (18, -0.14), (19, 0.142), (20, -0.093), (21, 0.037), (22, 0.227), (23, 0.059), (24, 0.183), (25, -0.208), (26, 0.022), (27, 0.053), (28, 0.002), (29, -0.079), (30, 0.029), (31, -0.008), (32, -0.055), (33, 0.01), (34, 0.064), (35, 0.007), (36, -0.199), (37, -0.014), (38, 0.058), (39, 0.095), (40, -0.037), (41, -0.094), (42, 0.007), (43, 0.037), (44, -0.054), (45, -0.055), (46, 0.139), (47, -0.004), (48, 0.026), (49, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.92195994 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
2 0.89172298 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
Author: Rahul Mazumder, Trevor Hastie, Robert Tibshirani
Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm S OFT-I MPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices; for example S OFT-I MPUTE takes a few hours to compute low-rank approximations of a 106 × 106 incomplete matrix with 107 observed entries, and fits a rank-95 approximation to the full Netflix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques. Keywords: collaborative filtering, nuclear norm, spectral regularization, netflix prize, large scale convex optimization
3 0.6468032 84 jmlr-2010-On Spectral Learning
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization
4 0.41730785 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
5 0.40421787 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
7 0.31571192 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
8 0.30986947 27 jmlr-2010-Consistent Nonparametric Tests of Independence
9 0.29852888 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
10 0.27930978 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
11 0.24863143 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
12 0.2413002 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression
13 0.23424013 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
14 0.23313877 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis
15 0.223066 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence
16 0.22002113 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
17 0.21843247 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
18 0.21556062 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
19 0.21224818 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
20 0.21124648 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
topicId topicWeight
[(3, 0.547), (8, 0.019), (15, 0.01), (21, 0.017), (32, 0.047), (36, 0.02), (37, 0.042), (51, 0.016), (75, 0.133), (81, 0.017), (85, 0.038)]
simIndex simValue paperId paperTitle
1 0.89142227 47 jmlr-2010-Hilbert Space Embeddings and Metrics on Probability Measures
Author: Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, Gert R.G. Lanckriet
Abstract: A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γk , indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γk . First, we consider the question of determining the conditions on the kernel k for which γk is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on Rd , then it is characteristic if and only if the support of its Fourier transform is the entire Rd . Second, we show that the distance between distributions under γk results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the ∗. Also at Carnegie Mellon University, Pittsburgh, PA 15213, USA. c 2010 Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch¨ lkopf and Gert R. G. Lanckriet. o ¨ S RIPERUMBUDUR , G RETTON , F UKUMIZU , S CH OLKOPF AND L ANCKRIET nature of the topology induced by γk , we relate γk to other popular metrics on probability measures, and present conditions on the kernel k und
same-paper 2 0.80486804 72 jmlr-2010-Matrix Completion from Noisy Entries
Author: Raghunandan H. Keshavan, Andrea Montanari, Sewoong Oh
Abstract: Given a matrix M of low-rank, we consider the problem of reconstructing it from noisy observations of a small, random subset of its entries. The problem arises in a variety of applications, from collaborative filtering (the ‘Netflix problem’) to structure-from-motion and positioning. We study a low complexity algorithm introduced by Keshavan, Montanari, and Oh (2010), based on a combination of spectral techniques and manifold optimization, that we call here O PT S PACE. We prove performance guarantees that are order-optimal in a number of circumstances. Keywords: matrix completion, low-rank matrices, spectral methods, manifold optimization
3 0.48729491 27 jmlr-2010-Consistent Nonparametric Tests of Independence
Author: Arthur Gretton, László Györfi
Abstract: Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics (L1 , log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distributionfree strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α-level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. Keywords: hypothesis test, independence, L1, log-likelihood, kernel methods, distribution-free consistent test
4 0.48016688 84 jmlr-2010-On Spectral Learning
Author: Andreas Argyriou, Charles A. Micchelli, Massimiliano Pontil
Abstract: In this paper, we study the problem of learning a matrix W from a set of linear measurements. Our formulation consists in solving an optimization problem which involves regularization with a spectral penalty term. That is, the penalty term is a function of the spectrum of the covariance of W . Instances of this problem in machine learning include multi-task learning, collaborative filtering and multi-view learning, among others. Our goal is to elucidate the form of the optimal solution of spectral learning. The theory of spectral learning relies on the von Neumann characterization of orthogonally invariant norms and their association with symmetric gauge functions. Using this tool we formulate a representer theorem for spectral regularization and specify it to several useful example, such as Schatten p−norms, trace norm and spectral norm, which should proved useful in applications. Keywords: kernel methods, matrix learning, minimal norm interpolation, multi-task learning, orthogonally invariant norms, regularization
5 0.47784224 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
Author: Liva Ralaivola, Marie Szafranski, Guillaume Stempfel
Abstract: PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first—to the best of our knowledge—PAC-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new PAC-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC) and classifiers trained on data distributed according to a stationary β-mixing process. In the way, we show how our approach seamlessly allows us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization bound for classifiers learned on data from stationary ϕ-mixing distributions. Keywords: PAC-Bayes bounds, non IID data, ranking, U-statistics, mixing processes c 2010 Liva Ralaivola, Marie Szafranski and Guillaume Stempfel. R ALAIVOLA , S ZAFRANSKI AND S TEMPFEL
6 0.47628227 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
7 0.47146821 82 jmlr-2010-On Learning with Integral Operators
8 0.46456 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
9 0.45803165 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
10 0.44353998 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
11 0.42773855 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
12 0.42108393 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
13 0.41338947 60 jmlr-2010-Learnability, Stability and Uniform Convergence
14 0.40757242 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
15 0.40731481 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
16 0.40289754 25 jmlr-2010-Composite Binary Losses
17 0.39404744 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation
18 0.3882736 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate
19 0.38137105 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
20 0.38023421 102 jmlr-2010-Semi-Supervised Novelty Detection