nips nips2011 nips2011-230 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolas Boumal, Pierre-antoine Absil
Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1
Reference: text
sentIndex sentText sentNum sentScore
1 RTRMC: A Riemannian trust-region method for low-rank matrix completion Nicolas Boumal∗ ICTEAM Institute Universit´ catholique de Louvain e B-1348 Louvain-la-Neuve nicolas. [sent-1, score-0.179]
2 be Abstract We consider large matrices of low rank. [sent-8, score-0.055]
3 We address the problem of recovering such matrices when most of the entries are unknown. [sent-9, score-0.125]
4 In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. [sent-11, score-0.064]
5 The known entries are the ratings given by users to some items. [sent-12, score-0.149]
6 We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. [sent-15, score-0.065]
7 1 Introduction We address the problem of recovering a low-rank m-by-n matrix X of which a few entries are observed, possibly with noise. [sent-19, score-0.134]
8 n} the set of indices of the observed entries of X, i. [sent-26, score-0.07]
9 Solving this problem is namely useful in recommender systems, where one tries to predict the ratings users would give to items they have not purchased. [sent-29, score-0.109]
10 1 Related work In the noiseless case, one could state the minimum rank matrix recovery problem as follows: ˆ ˆ min rank X, such that Xij = Xij ∀(i, j) ∈ Ω. [sent-31, score-0.231]
11 A possible convex relaxation of (1) introduced by Cand` s and e ˆ Recht [CR09] is to use the nuclear norm of X as objective function, i. [sent-33, score-0.048]
12 The SVT method [CCS08] attempts to solve such a convex problem using tools from compressed sensing and the ADMiRA method [LB10] does so using matching pursuit-like techniques. [sent-36, score-0.048]
13 ˆ Alternatively, one may minimize the discrepancy between X and X at entries Ω under the constraint that ˆ ≤ r for some small constant r. [sent-37, score-0.07]
14 Since any matrix X of rank at most r may be written in the form U W ˆ rank(X) with U ∈ Rm×r and W ∈ Rr×n , a reasonable formulation of the problem reads: min m×r U ∈R ∗ 2 (U W )ij − Xij . [sent-38, score-0.159]
15 ˆ One drawback of the latter formulation is that the factorization of a matrix X into the product U W is not unique. [sent-43, score-0.105]
16 Indeed, for any r-by-r invertible matrix M , we have U W = (U M )(M −1 W ). [sent-44, score-0.064]
17 All the matrices U M share the same column space. [sent-45, score-0.055]
18 [DMK11, DKM10] exploit this to recast (2) on the Grassmann manifold G(m, r), i. [sent-48, score-0.124]
19 , the set of r-dimensional vector subspaces of Rm (see Section 2): min 2 (U W )ij − Xij , min U ∈G(m,r) W ∈Rr×n (3) (i,j)∈Ω where U ∈ Rm×r is any matrix such that col(U ) = U and is often chosen to be orthonormal. [sent-50, score-0.142]
20 Unfortunately, the objective function of the outer minimization in (3) may be discontinuous at points U for which the leastsquares problem in W does not have a unique solution. [sent-51, score-0.075]
21 [KO09, KM10] state the problem on the Grassmannian too, but propose to simultaneously optimize on the row and column spaces, yielding a smaller least-squares problem which is unlikely to not have a unique solution, resulting in a smooth objective function. [sent-56, score-0.048]
22 In one of their recent papers [KM10], they solve: min (U SV )ij − Xij min r×r U ∈G(m,r),V ∈G(n,r) S∈R 2 + λ2 U SV 2 F , (4) (i,j)∈Ω where U and V are any orthonormal bases of U and V , respectively, and λ is a regularization parameter. [sent-57, score-0.108]
23 The authors propose an efficient SVD-based initial guess for U and V which they refine using a steepest descent method, along with strong theoretical guarantees. [sent-58, score-0.067]
24 [BNR10] introduced GROUSE for subspace identification on the Grassmannian, applicable to matrix completion. [sent-63, score-0.064]
25 Finally, in the preprint [Van11] which became public while we were preparing the camera-ready version of this paper, Vandereycken proposes an approach based on the submanifold geometry of the sets of fixed-rank matrices. [sent-64, score-0.072]
26 ’s initial formulation (3) has a discontinuous objective function on the Grassmannian. [sent-67, score-0.075]
27 Furthermore, the OptSpace regularization term is efficiently computable since U SV F = S F , but it penalizes all entries instead of just the entries (i, j) ∈ Ω. [sent-69, score-0.14]
28 / In an effort to combine the best of both worlds, we equip (3) with a regularization term weighted by λ > 0, which yields a smooth objective function defined over an appropriate search space: 1 λ2 2 2 min min Cij (U W )ij − Xij + (U W )2 . [sent-70, score-0.094]
29 (5) ij r×n 2 2 U ∈G(m,r) W ∈R (i,j)∈Ω / (i,j)∈Ω Here, we introduced a confidence index Cij > 0 for each observation Xij , which may be useful in applications. [sent-71, score-0.038]
30 As we will see, introducing a regularization term is essential to ensure smoothness of the objective and hence obtain good convergence properties. [sent-72, score-0.048]
31 GenRTR is readily available as a free Matlab package and comes with strong convergence results that are naturally inherited by our algorithms. [sent-75, score-0.063]
32 In Section 3, we derive expressions for the gradient and the Hessian of our objective function while paying special attention to complexity. [sent-77, score-0.048]
33 2 2 Geometry of the Grassmann manifold Our objective function f (10) is defined over the Grassmann manifold G(m, r), i. [sent-80, score-0.236]
34 Each point U ∈ G(m, r) is a vector subspace we may represent numerically as the column space of a fullrank matrix U ∈ Rm×r : U = col(U ). [sent-86, score-0.064]
35 For numerical reasons, we will only use orthonormal matrices U ∈ U(m, r) = {U ∈ Rm×r : U U = Ir }. [sent-87, score-0.147]
36 The Grassmannian is a Riemannian manifold, and as such we can define a tangent space to G(m, r) at each point U , noted TU G(m, r). [sent-89, score-0.119]
37 A tangent vector H ∈ TU G(m, r), where we represent U as the orthonormal matrix U , is represented by a unique d matrix H ∈ Rm×r verifying U H = 0 and dt col(U + tH) t=0 = H . [sent-91, score-0.309]
38 Each tangent space is endowed with an inner product, the Riemannian metric, that varies smoothly from point to point. [sent-93, score-0.119]
39 It is inherited from the embedding space of the matrix representation of tangent vectors Rm×r : ∀H1 , H2 ∈ TU G(m, r) : H1 , H2 U = Trace(H2 H1 ). [sent-94, score-0.211]
40 The orthogonal projector from Rm×r onto the tangent space TU G(m, r) is given by: PU : Rm×r → TU G(m, r) : H → PU H = (I − U U )H. [sent-95, score-0.119]
41 One can also project a vector onto the tangent space of the Stiefel manifold: St St PU : Rm×r → TU U(m, r) : H → PU H = (I − U U )H + U skew(U H), where skew(X) = (X − X )/2 extracts the skew-symmetric part of X. [sent-96, score-0.119]
42 This is useful for the computation of ¯ gradf (U ) ∈ TU G(m, r). [sent-97, score-0.176]
43 39)], considering f : Rm×r → R, ¯ ¯ its restriction f U (m,r) to the Stiefel manifold and f : G(m, r) → R such that f (col(U )) = f U (m,r) (U ) is well-defined, as will be the case in Section 3, we have (with a slight abuse of notation): ¯ ¯ gradf (U ) = grad f (U ) = P St gradf (U ). [sent-101, score-0.498]
44 18)]: ¯ Hessf (U )[H] = PU (D(U → P St gradf (U ))(U )[H]), (7) U where Dg(X)[H] is the directional derivative of g at X along H, in the classical sense. [sent-105, score-0.201]
45 For our optimization algorithms, it is important to be able to move along the manifold from some initial point U in some prescribed direction specified by a tangent vector H. [sent-106, score-0.266]
46 Computation of the objective function and its derivatives ˆ ˆ We seek an m-by-n matrix X of rank not more than r such that X is as close as possible to a given matrix X at the entries in the observation set Ω. [sent-108, score-0.318]
47 Furthermore, we are given a weight matrix C ∈ Rm×n indicating the confidence we have in each observed entry of X. [sent-109, score-0.064]
48 The matrix C is positive at entries in Ω and zero elsewhere. [sent-110, score-0.134]
49 Picking a small but positive λ will ensure that the objective function f (10) is smooth. [sent-112, score-0.048]
50 ˆ For a fixed U , computing the matrix W that minimizes f is a least-squares problem. [sent-113, score-0.064]
51 ˆ By virtue of the discussion in Section 1, we know that the mapping U → f (U, WU ), with U ∈ Rm×r , is constant over sets of full-rank matrices U spanning the same column space. [sent-116, score-0.055]
52 Hence, considering these sets as equivalence classes U , the following function f over the Grassmann manifold is well-defined: ˆ f : G(m, r) → R : U → f (U ) = f (U, WU ), (10) with any full-rank U ∈ Rm×r such that col(U ) = U . [sent-117, score-0.094]
53 The interpretation is as follows: we are looking for an ˆ ˆ optimal matrix X = U W of rank at most r; we have confidence Cij that Xij should equal Xij for (i, j) ∈ Ω ˆ and (very small) confidence λ that Xij should equal 0 for (i, j) ∈ Ω. [sent-118, score-0.136]
54 1 Rearranging the objective ˆ Considering (9), it looks like evaluating f (U, W ) will require the computation of the product U W at the entries ¯ i. [sent-120, score-0.118]
55 , we would need to compute the whole matrix U W , which cannot cost much less than O(mnr). [sent-122, score-0.064]
56 Alternatively, if we restrict ourselves—without loss of generality—to orthonormal matrices U , we observe that UW 2 Ω 2 2 2 + UW Ω = UW F = W F . [sent-124, score-0.117]
57 (11) 2 2 2 ¯ This only requires the computation of U W at entries in Ω, at a cost of O(|Ω|r). [sent-126, score-0.07]
58 2 Gradient and Hessian of the objective We now derive formulas for the first and second order derivatives of f . [sent-129, score-0.098]
59 In deriving these formulas, it is useful to note that, for a suitably smooth mapping g, grad X → 1/2 g(X) ∗ 2 F (X) = Dg(X) [g(X)], (12) ∗ where Dg(X) is the adjoint of the differential of g at X. [sent-130, score-0.052]
60 For ease of notation, let us define the following m-by-n matrix with the sparsity structure induced by Ω: 2 Cij − λ2 if (i, j) ∈ Ω, ˆ Cij = (13) 0 otherwise. [sent-131, score-0.064]
61 We also introduce a sparse residue matrix RU that will come up in various formulas: ˆ RU = C (U WU − XΩ ) − λ2 XΩ . [sent-132, score-0.064]
62 (14) Successively using the chain rule, the optimality of WU and (12), we obtain: d ˇ ∂ ˇ ∂ ˇ d ∂ ˇ ¯ gradf (U ) = f (U, WU ) = f (U, WU ) + f (U, WU ) · WU = f (U, WU ) = RU WU . [sent-133, score-0.176]
63 We note H a matrix representation of the tangent vector H chosen in accordance with U and WU,H D(U → WU )(U )[H] the derivative of the mapping U → WU at U along the tangent direction H. [sent-136, score-0.327]
64 We assume U ∈ U(m, r) since we use orthonormal matrices to represent points on the Grassmannian and U H = 0 since H is a tangent vector at U . [sent-140, score-0.236]
65 We use the vectorization operator, vec, that transforms matrices into vectors by stacking their columns—in Matlab notation, vec(A) = A(:). [sent-141, score-0.055]
66 Denoting the Kronecker product of two matrices by ⊗, we will use the well-known identity for matrices A, Y, B of appropriate sizes [Bro05]: vec(AY B) = (B ⊗ A)vec(Y ). [sent-142, score-0.11]
67 We also write IΩ for the orthonormal |Ω|-by-mn matrix such that vecΩ (M ) = IΩ vec(M ) is a vector of length |Ω| corresponding to the entries Mij for (i, j) ∈ Ω, taken in order from vec(M ). [sent-143, score-0.196]
68 ˇ Computing WU comes down to minimizing the least-squares objective f (U, W ) (11) with respect to W . [sent-144, score-0.048]
69 (17) Observe that the matrix A is block-diagonal, with n symmetric blocks of size r. [sent-153, score-0.064]
70 To solve systems in A, we compute the Cholesky factorization of each block, at a total cost of O(nr3 ). [sent-156, score-0.066]
71 Collecting all equations in this subsection, we obtain a closed-form formula for WU : vec(WU ) = A−1 vec U [C (2) XΩ ] , (18) where A is a function of U . [sent-158, score-0.384]
72 Using bilinearity and associativity of ⊗ as well as the formula D(Y → Y −1 )(X)[H] = −X −1 HX −1 [Bro05], some algebra yields: vec(WU,H ) = −A−1 vec H RU + U 5 ˆ C (HWU ) . [sent-160, score-0.384]
73 Table 1: All complexities are essentially linear in |Ω|, the number of observed entries. [sent-173, score-0.048]
74 At the current iterate U = col(U ), the RTR method uses the retraction RU (8) to build a quadratic model mU : TU G(m, r) → R of the lifted objective function f ◦ RU (lift). [sent-182, score-0.087]
75 It then classically minimizes the model inside a trust region on this vector space (solve), and retracts the resulting tangent vector H to a candidate U + = RU (H) on the Grassmannian (retract). [sent-183, score-0.171]
76 Likewise, the radius of the trust region is adapted based on the observed quality of the model. [sent-185, score-0.052]
77 Typically, the faster one can compute A(U )[H], the faster one can minimize mU (H) in the trust region. [sent-187, score-0.052]
78 mU (H) = f (U ) + gradf (U ), H U + A powerful property of the RTR method is that global convergence of the algorithm toward critical points—local minimizers in practice since it is a descent method—is guaranteed independently of A(U ) [ABG07, Thm 4. [sent-188, score-0.176]
79 Additionally, if we take A(U ) to be the Hessian of f at U (16), we get a quadratic convergence rate, even if we only approximately minimize mU within the trust region using a few steps of a well chosen iterative method [ABG07, Thm 4. [sent-194, score-0.052]
80 6 Table 2: All Matlab implementations call subroutines in non-Matlab code to efficiently deal with the sparsity of the matrices involved. [sent-204, score-0.055]
81 For the initial guess U0 , we use the OptSpace trimmed SVD. [sent-210, score-0.081]
82 Our methods (RTRMC 1 and 2) and Balanced Factorization require knowledge of the target rank r. [sent-219, score-0.072]
83 OptSpace, ADMiRA and LMaFit include a mechanism to guess the rank, but benefit from knowing it, hence we provide the true rank to these methods too. [sent-220, score-0.114]
84 The dimension of the manifold of m-by-n matrices of rank r is d = r(m+n−r). [sent-226, score-0.221]
85 5d entries uniformly at random, which yields a sampling ratio of 0. [sent-233, score-0.07]
86 Figure 1 is typical and shows the evolution of the RMSE as a function of time (left) and iteration count (right). [sent-235, score-0.06]
87 Be wary though that this formula is numerically inaccurate when the RMSE is much smaller than the norm of either AB or U V , owing to the computation of the difference of close large numbers. [sent-237, score-0.057]
88 In this second test, we repeat the previous experience with rectangular matrices: m = 1 000, n = 30 000, r = 5 and a sampling ratio of 2. [sent-239, score-0.048]
89 We expect RTRMC to perform well on rectangular matrices since the dimension of the Grassmann manifold we optimize on only grows linearly with min(m, n), whereas it is the (simple) least-squares problem dimension that grows linearly in max(m, n). [sent-241, score-0.197]
90 Following the protocol in [KMO09], we test our method on the Jester dataset 1 [GRGP01] of ratings of a hundred jokes by 24 983 users. [sent-244, score-0.054]
91 We randomly select 4 000 users and the corresponding continuous ratings in the range [−10, 10]. [sent-245, score-0.079]
92 For each user, we extract two ratings at random as test data. [sent-246, score-0.054]
93 We run the different matrix completion algorithms with a prescribed rank on the remaining training data, N = 100 times for each rank. [sent-247, score-0.245]
94 6 Conclusion Our contribution is an efficient numerical method to solve large low-rank matrix completion problems. [sent-250, score-0.2]
95 All algorithms solve the problem in well under a minute for rank 7. [sent-260, score-0.097]
96 For rectangular matrices, RTRMC is especially efficient owing to the linear growth of the dimension of the search space in min(m, n), whereas for most methods the growth is linear in m + n. [sent-305, score-0.071]
97 Subspace evolution and transfer (SET) for low-rank matrix completion. [sent-365, score-0.096]
98 Low-rank matrix completion with noisy observations: a quantitative comparison. [sent-386, score-0.145]
99 OptSpace: A gradient descent algorithm on the Grassman manifold for matrix completion. [sent-395, score-0.158]
100 Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. [sent-432, score-0.186]
wordName wordTfidf (topN-words)
[('rtrmc', 0.391), ('wu', 0.351), ('vec', 0.35), ('ru', 0.239), ('gradf', 0.176), ('admira', 0.174), ('optspace', 0.172), ('col', 0.148), ('riemannian', 0.144), ('pu', 0.142), ('grassmannian', 0.138), ('rm', 0.125), ('xij', 0.124), ('tu', 0.121), ('grassmann', 0.12), ('lmafit', 0.12), ('svt', 0.12), ('tangent', 0.119), ('rmse', 0.111), ('uw', 0.103), ('manifold', 0.094), ('matlab', 0.093), ('absil', 0.086), ('completion', 0.081), ('rank', 0.072), ('hessian', 0.07), ('mu', 0.07), ('entries', 0.07), ('rtr', 0.069), ('irn', 0.069), ('dai', 0.067), ('matrix', 0.064), ('skew', 0.063), ('propack', 0.063), ('cij', 0.062), ('orthonormal', 0.062), ('rr', 0.062), ('genrtr', 0.059), ('nmae', 0.059), ('rrn', 0.059), ('stiefel', 0.059), ('matrices', 0.055), ('ratings', 0.054), ('keshavan', 0.054), ('trust', 0.052), ('allerton', 0.052), ('grad', 0.052), ('hessf', 0.052), ('formulas', 0.05), ('complexities', 0.048), ('rectangular', 0.048), ('objective', 0.048), ('guess', 0.042), ('mn', 0.041), ('factorization', 0.041), ('sv', 0.04), ('hwu', 0.039), ('jester', 0.039), ('parallelizable', 0.039), ('retraction', 0.039), ('trimmed', 0.039), ('ij', 0.038), ('cholesky', 0.037), ('mij', 0.037), ('dg', 0.037), ('preprint', 0.037), ('arxiv', 0.036), ('package', 0.035), ('geometry', 0.035), ('st', 0.035), ('belgian', 0.034), ('meyer', 0.034), ('qf', 0.034), ('thm', 0.034), ('balzano', 0.034), ('catholique', 0.034), ('icteam', 0.034), ('louvain', 0.034), ('formula', 0.034), ('scenario', 0.033), ('evolution', 0.032), ('subspaces', 0.032), ('numerical', 0.03), ('recommender', 0.03), ('goldberg', 0.03), ('recast', 0.03), ('svd', 0.03), ('count', 0.028), ('prescribed', 0.028), ('inherited', 0.028), ('cand', 0.028), ('balanced', 0.027), ('dence', 0.027), ('discontinuous', 0.027), ('users', 0.025), ('along', 0.025), ('solve', 0.025), ('owing', 0.023), ('tools', 0.023), ('min', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
Author: Nicolas Boumal, Pierre-antoine Absil
Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1
2 0.12408377 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
Author: Oliver Stegle, Christoph Lippert, Joris M. Mooij, Neil D. Lawrence, Karsten M. Borgwardt
Abstract: Inference in matrix-variate Gaussian models has major applications for multioutput prediction and joint learning of row and column covariances from matrixvariate data. Here, we discuss an approach for efficient inference in such models that explicitly account for iid observation noise. Computational tractability can be retained by exploiting the Kronecker product between row and column covariance matrices. Using this framework, we show how to generalize the Graphical Lasso in order to learn a sparse inverse covariance between features while accounting for a low-rank confounding covariance between samples. We show practical utility on applications to biology, where we model covariances with more than 100,000 dimensions. We find greater accuracy in recovering biological network structures and are able to better reconstruct the confounders. 1
3 0.10056312 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
Author: Yong Zhang, Zhaosong Lu
Abstract: In this paper we consider general rank minimization problems with rank appearing in either objective function or constraint. We first show that a class of matrix optimization problems can be solved as lower dimensional vector optimization problems. As a consequence, we establish that a class of rank minimization problems have closed form solutions. Using this result, we then propose penalty decomposition methods for general rank minimization problems. The convergence results of the PD methods have been shown in the longer version of the paper [19]. Finally, we test the performance of our methods by applying them to matrix completion and nearest low-rank correlation matrix problems. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed. 1
4 0.092984945 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
Author: Binbin Lin, Chiyuan Zhang, Xiaofei He
Abstract: This paper studies the problem of semi-supervised learning from the vector field perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient field of a linear function has to be a parallel vector field. Consequently, we propose to find a function which minimizes the empirical error, and simultaneously requires its gradient field to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efficiently. The experimental results have demonstrated the effectiveness of our proposed approach. 1
5 0.088750616 287 nips-2011-The Manifold Tangent Classifier
Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller
Abstract: We combine three important ideas present in previous work for building classifiers: the semi-supervised hypothesis (the input distribution contains information about the classifier), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classification (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classifier to be insensitive to local directions changes along the manifold. Record-breaking classification results are obtained. 1
6 0.077139869 102 nips-2011-Generalised Coupled Tensor Factorisation
7 0.076530844 258 nips-2011-Sparse Bayesian Multi-Task Learning
8 0.075270273 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
9 0.074700102 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
10 0.068105415 270 nips-2011-Statistical Performance of Convex Tensor Decomposition
11 0.065534547 165 nips-2011-Matrix Completion for Multi-label Image Classification
12 0.062008224 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
13 0.057228472 5 nips-2011-A Denoising View of Matrix Completion
14 0.054557785 73 nips-2011-Divide-and-Conquer Matrix Factorization
15 0.052790333 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
16 0.049385734 260 nips-2011-Sparse Features for PCA-Like Linear Regression
17 0.049100965 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
18 0.047160648 271 nips-2011-Statistical Tests for Optimization Efficiency
19 0.046951152 263 nips-2011-Sparse Manifold Clustering and Embedding
20 0.046680298 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
topicId topicWeight
[(0, 0.129), (1, 0.005), (2, -0.03), (3, -0.093), (4, -0.077), (5, 0.051), (6, 0.009), (7, 0.029), (8, 0.062), (9, 0.046), (10, 0.061), (11, -0.023), (12, -0.035), (13, -0.029), (14, 0.034), (15, -0.078), (16, -0.106), (17, 0.053), (18, -0.063), (19, -0.021), (20, -0.046), (21, 0.02), (22, 0.068), (23, -0.007), (24, 0.029), (25, 0.033), (26, -0.145), (27, -0.169), (28, 0.031), (29, -0.023), (30, 0.016), (31, 0.031), (32, -0.066), (33, -0.054), (34, -0.012), (35, 0.043), (36, -0.099), (37, 0.08), (38, -0.022), (39, -0.048), (40, -0.122), (41, -0.028), (42, 0.134), (43, 0.04), (44, 0.131), (45, 0.023), (46, -0.047), (47, 0.009), (48, 0.145), (49, -0.076)]
simIndex simValue paperId paperTitle
same-paper 1 0.93941432 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
Author: Nicolas Boumal, Pierre-antoine Absil
Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1
2 0.61331707 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa
Abstract: In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-filtering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efficient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.
3 0.6125074 5 nips-2011-A Denoising View of Matrix Completion
Author: Weiran Wang, Zhengdong Lu, Miguel Á. Carreira-Perpiñán
Abstract: In matrix completion, we are given a matrix where the values of only some of the entries are present, and we want to reconstruct the missing ones. Much work has focused on the assumption that the data matrix has low rank. We propose a more general assumption based on denoising, so that we expect that the value of a missing entry can be predicted from the values of neighboring points. We propose a nonparametric version of denoising based on local, iterated averaging with meanshift, possibly constrained to preserve local low-rank manifold structure. The few user parameters required (the denoising scale, number of neighbors and local dimensionality) and the number of iterations can be estimated by cross-validating the reconstruction error. Using our algorithms as a postprocessing step on an initial reconstruction (provided by e.g. a low-rank method), we show consistent improvements with synthetic, image and motion-capture data. Completing a matrix from a few given entries is a fundamental problem with many applications in machine learning, computer vision, network engineering, and data mining. Much interest in matrix completion has been caused by recent theoretical breakthroughs in compressed sensing [1, 2] as well as by the now celebrated Netflix challenge on practical prediction problems [3, 4]. Since completion of arbitrary matrices is not a well-posed problem, it is often assumed that the underlying matrix comes from a restricted class. Matrix completion models almost always assume a low-rank structure of the matrix, which is partially justified through factor models [4] and fast convex relaxation [2], and often works quite well when the observations are sparse and/or noisy. The low-rank structure of the matrix essentially asserts that all the column vectors (or the row vectors) live on a low-dimensional subspace. This assumption is arguably too restrictive for problems with richer structure, e.g. when each column of the matrix represents a snapshot of a seriously corrupted motion capture sequence (see section 3), for which a more flexible model, namely a curved manifold, is more appropriate. In this paper, we present a novel view of matrix completion based on manifold denoising, which conceptually generalizes the low-rank assumption to curved manifolds. Traditional manifold denoising is performed on fully observed data [5, 6], aiming to send the data corrupted by noise back to the correct surface (defined in some way). However, with a large proportion of missing entries, we may not have a good estimate of the manifold. Instead, we start with a poor estimate and improve it iteratively. Therefore the “noise” may be due not just to intrinsic noise, but mostly to inaccurately estimated missing entries. We show that our algorithm can be motivated from an objective purely based on denoising, and prove its convergence under some conditions. We then consider a more general case with a nonlinear low-dimensional manifold and use a stopping criterion that works successfully in practice. Our model reduces to a low-rank model when we require the manifold to be flat, showing a relation with a recent thread of matrix completion models based on alternating projection [7]. In our experiments, we show that our denoising-based matrix completion model can make better use of the latent manifold structure on both artificial and real-world data sets, and yields superior recovery of the missing entries. The paper is organized as follows: section 1 reviews nonparametric denoising methods based on mean-shift updates, section 2 extends this to matrix completion by using denoising with constraints, section 3 gives experimental results, and section 4 discusses related work. 1 1 Denoising with (manifold) blurring mean-shift algorithms (GBMS/MBMS) In Gaussian blurring mean-shift (GBMS), denoising is performed in a nonparametric way by local averaging: each data point moves to the average of its neighbors (to a certain scale), and the process is repeated. We follow the derivation in [8]. Consider a dataset {xn }N ⊂ RD and define a n=1 Gaussian kernel density estimate p(x) = 1 N N Gσ (x, xn ) (1) n=1 1 with bandwidth σ > 0 and kernel Gσ (x, xn ) ∝ exp − 2 ( x − xn /σ)2 (other kernels may be used, such as the Epanechnikov kernel, which results in sparse affinities). The (non-blurring) mean-shift algorithm rearranges the stationary point equation ∇p(x) = 0 into the iterative scheme x(τ +1) = f (x(τ ) ) with N x (τ +1) = f (x (τ ) p(n|x )= (τ ) )xn p(n|x (τ ) n=1 )= exp − 1 (x(τ ) − xn )/σ 2 N n′ =1 2 exp − 1 (x(τ ) − xn′ )/σ 2 2 . (2) This converges to a mode of p from almost every initial x ∈ RD , and can be seen as taking selfadapting step sizes along the gradient (since the mean shift f (x) − x is parallel to ∇p(x)). This iterative scheme was originally proposed by [9] and it or variations of it have found widespread application in clustering [8, 10–12] and denoising of 3D point sets (surface fairing; [13, 14]) and manifolds in general [5, 6]. The blurring mean-shift algorithm applies one step of the previous scheme, initialized from every point, in parallel for all points. That is, given the dataset X = {x1 , . . . , xN }, for each xn ∈ X ˜ we obtain a new point xn = f (xn ) by applying one step of the mean-shift algorithm, and then we ˜ replace X with the new dataset X, which is a blurred (shrunk) version of X. By iterating this process we obtain a sequence of datasets X(0) , X(1) , . . . (and a corresponding sequence of kernel density estimates p(0) (x), p(1) (x), . . .) where X(0) is the original dataset and X(τ ) is obtained by blurring X(τ −1) with one mean-shift step. We can see this process as maximizing the following objective function [10] by taking parallel steps of the form (2) for each point: N p(xn ) = E(X) = n=1 1 N N N 1 e− 2 Gσ (xn , xm ) ∝ xn −xm σ 2 . (3) n,m=1 n,m=1 This process eventually converges to a dataset X(∞) where all points are coincident: a completely denoised dataset where all structure has been erased. As shown by [8], this process can be stopped early to return clusters (= locally denoised subsets of points); the number of clusters obtained is controlled by the bandwidth σ. However, here we are interested in the denoising behavior of GBMS. ˜ The GBMS step can be formulated in a matrix form reminiscent of spectral clustering [8] as X = X P where X = (x1 , . . . , xN ) is a D×N matrix of data points; W is the N ×N matrix of Gaussian N affinities wnm = Gσ (xn , xm ); D = diag ( n=1 wnm ) is the degree matrix; and P = WD−1 is N an N × N stochastic matrix: pnm = p(n|xm ) ∈ (0, 1) and n=1 pnm = 1. P (or rather its transpose) is the stochastic matrix of the random walk in a graph [15], which in GBMS represents the posterior probabilities of each point under the kernel density estimate (1). P is similar to the 1 1 matrix N = D− 2 WD− 2 derived from the normalized graph Laplacian commonly used in spectral clustering, e.g. in the normalized cut [16]. Since, by the Perron-Frobenius theorem [17, ch. 8], all left eigenvalues of P(X) have magnitude less than 1 except for one that equals 1 and is associated with ˜ an eigenvector of constant entries, iterating X = X P(X) converges to the stationary distribution of each P(X), where all points coincide. ˜ From this point of view, the product X = X P(X) can be seen as filtering the dataset X with a datadependent low-pass filter P(X), which makes clear the denoising behavior. This also suggests using ˜ other filters [12] X = X φ(P(X)) as long as φ(1) = 1 and |φ(r)| < 1 for r ∈ [0, 1), such as explicit schemes φ(P) = (1 − η)I + ηP for η ∈ (0, 2], power schemes φ(P) = Pn for n = 1, 2, 3 . . . or implicit schemes φ(P) = ((1 + η)I − ηP)−1 for η > 0. One important problem with GBMS is that it denoises equally in all directions. When the data lies on a low-dimensional manifold, denoising orthogonally to it removes out-of-manifold noise, but 2 denoising tangentially to it perturbs intrinsic degrees of freedom of the data and causes shrinkage of the entire manifold (most strongly near its boundary). To prevent this, the manifold blurring meanshift algorithm (MBMS) [5] first computes a predictor averaging step with GBMS, and then for each point xn a corrector projective step removes the step direction that lies in the local tangent space of xn (obtained from local PCA run on its k nearest neighbors). In practice, both GBMS and MBMS must be stopped early to prevent excessive denoising and manifold distortions. 2 Blurring mean-shift denoising algorithms for matrix completion We consider the natural extension of GBMS to the matrix completion case by adding the constraints given by the present values. We use the subindex notation XM and XP to indicate selection of the missing or present values of the matrix XD×N , where P ⊂ U , M = U \ P and U = {(d, n): d = 1, . . . , D, n = 1, . . . , N }. The indices P and values XP of the present matrix entries are the data of the problem. Then we have the following constrained optimization problem: N Gσ (xn , xm ) max E(X) = X s.t. XP = XP . (4) n,m=1 This is similar to low-rank formulations for matrix completion that have the same constraints but use as objective function the reconstruction error with a low-rank assumption, e.g. X − ABX 2 with AD×L , BL×D and L < D. We initialize XM to the output of some other method for matrix completion, such as singular value projection (SVP; [7]). For simple constraints such as ours, gradient projection algorithms are attractive. The gradient of E wrt X is a matrix of D × N whose nth column is: ∇xn E(X) = 2 σ2 N 1 e− 2 xn −xm σ 2 N (xm − xn ) ∝ m=1 2 p(m|xn )xm p(xn ) −xn + σ2 m=1 (5) and its projection on the constraint space is given by zeroing its entries having indices in P; call ΠP this projection operator. Then, we have the following step of length α ≥ 0 along the projected gradient: (τ +1) X(τ +1) = X(τ ) + αΠP (∇X E(X(τ ) )) ⇐⇒ XM (τ ) = XM + α ΠP (∇X E(X(τ ) )) M (6) which updates only the missing entries XM . Since our search direction is ascent and makes an angle with the gradient that is bounded away from π/2, and E is lower bounded, continuously differentiable and has bounded Hessian (thus a Lipschitz continuous gradient) in RN L , by carrying out a line search that satisfies the Wolfe conditions, we are guaranteed convergence to a local stationary point, typically a maximizer [18, th. 3.2]. However, as reasoned later, we do not perform a line search at all, instead we fix the step size to the GBMS self-adapting step size, which results in a simple and faster algorithm consisting of carrying out a GBMS step on X (i.e., X(τ +1) = X(τ ) P(X(τ ) )) and then refilling XP to the present values. While we describe the algorithm in this way for ease of explanation, in practice we do not actually compute the GBMS step for all xdn values, but only for the missing ones, which is all we need. Thus, our algorithm carries out GBMS denoising steps within the missing-data subspace. We can derive this result in a different way by starting from N the unconstrained optimization problem maxXP E(X) = n,m=1 Gσ (xn , xm ) (equivalent to (4)), computing its gradient wrt XP , equating it to zero and rearranging (in the same way the mean-shift algorithm is derived) to obtain a fixed-point iteration identical to our update above. Fig. 1 shows the pseudocode for our denoising-based matrix completion algorithms (using three nonparametric denoising algorithms: GBMS, MBMS and LTP). Convergence and stopping criterion As noted above, we have guaranteed convergence by simply satisfying standard line search conditions, but a line search is costly. At present we do not have (τ +1) a proof that the GBMS step size satisfies such conditions, or indeed that the new iterate XM increases or leaves unchanged the objective, although we have never encountered a counterexample. In fact, it turns out that none of the work about GBMS that we know about proves that either: [10] proves that ∅(X(τ +1) ) ≤ ∅(X(τ ) ) for 0 < ρ < 1, where ∅(·) is the set diameter, while [8, 12] 3 notes that P(X) has a single eigenvalue of value 1 and all others of magnitued less than 1. While this shows that all points converge to the same location, which indeed is the global maximum of (3), it does not necessarily follow that each step decreases E. GBMS (k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X However, the question of convergence as τ → ∞ has no practical interest in a denoising setting, because achieving a total denoising almost never yields a good matrix completion. What we want is to achieve just enough denoising and stop the algorithm, as was the case with GBMS clustering, and as is the case in algorithms for image denoising. We propose to determine the optimal number of iterations, as well as the bandwidth σ and any other parameters, by cross-validation. Specifically, we select a held-out set by picking a random subset of the present entries and considering them as missing; this allows us to evaluate an error between our completion for them and the ground truth. We stop iterating when this error increases. MBMS (L, k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) Xn ← k nearest neighbors of xn (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn subtract parallel motion ∂xn ← (I − Un UT )∂xn n end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X This argument justifies an algorithmic, as opposed to an opLTP (L, k) with k-nn graph: given XD×N , M timization, view of denoisingrepeat based matrix completion: apfor n = 1, . . . , N ply a denoising step, refill the Xn ← k nearest neighbors of xn present values, iterate until the (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn validation error increases. This project point onto tangent space allows very general definitions ∂xn ← (I − Un UT )(µn − xn ) n end of denoising, and indeed a lowXM ← XM + (∂X)M move points’ missing entries rank projection is a form of deuntil validation error increases noising where points are not alreturn X lowed outside the linear manifold. Our formulation using Figure 1: Our denoising matrix completion algorithms, based on the objective function (4) is still Manifold Blurring Mean Shift (MBMS) and its particular cases useful in that it connects our Local Tangent Projection (LTP, k-nn graph, σ = ∞) and Gauss- denoising assumption with the ian Blurring Mean Shift (GBMS, L = 0); see [5] for details. Nn more usual low-rank assumption contains all N points (full graph) or only xn ’s nearest neighbors that has been used in much ma(k-nn graph). The index M selects the components of its input trix completion work, and juscorresponding to missing values. Parameters: denoising scale σ, tifies the refilling step as renumber of neighbors k, local dimensionality L. sulting from the present-data constraints under a gradientprojection optimization. MBMS denoising for matrix completion Following our algorithmic-based approach to denois˜ ing, we could consider generalized GBMS steps of the form X = X φ(P(X)). For clustering, Carreira-Perpi˜ an [12] found an overrelaxed explicit step φ(P) = (1 − η)I + ηP with η ≈ 1.25 to n´ achieve similar clusterings but faster. Here, we focus instead on the MBMS variant of GBMS that allows only for orthogonal, not tangential, point motions (defined wrt their local tangent space as estimated by local PCA), with the goal of preserving low-dimensional manifold structure. MBMS has 3 user parameters: the bandwidth σ (for denoising), and the latent dimensionality L and the 4 number of neighbors k (for the local tangent space and the neighborhood graph). A special case of MBMS called local tangent projection (LTP) results by using a neighborhood graph and setting σ = ∞ (so only two user parameters are needed: L and k). LTP can be seen as doing a low-rank matrix completion locally. LTP was found in [5] to have nearly as good performance as the best σ in several problems. MBMS also includes as particular cases GBMS (L = 0), PCA (k = N , σ = ∞), and no denoising (σ = 0 or L = D). Note that if we apply MBMS to a dataset that lies on a linear manifold of dimensionality d using L ≥ d then no denoising occurs whatsoever because the GBMS updates lie on the d-dimensional manifold and are removed by the corrector step. In practice, even if the data are assumed noiseless, the reconstruction from a low-rank method will lie close to but not exactly on the d-dimensional manifold. However, this suggests using largish ranks for the low-rank method used to reconstruct X and lower L values in the subsequent MBMS run. In summary, this yields a matrix completion algorithm where we apply an MBMS step, refill the present values, and iterate until the validation error increases. Again, in an actual implementation we compute the MBMS step only for the missing entries of X. The shrinking problem of GBMS is less pronounced in our matrix completion setting, because we constrain some values not to change. Still, in agreement with [5], we find MBMS to be generally superior to GBMS. Computational cost With a full graph, the cost per iteration of GBMS and MBMS is O(N 2 D) and O(N 2 D + N (D + k) min(D, k)2 ), respectively. In practice with high-dimensional data, best denoising results are obtained using a neighborhood graph [5], so that the sums over points in eqs. (3) or (4) extend only to the neighbors. With a k-nearest-neighbor graph and if we do not update the neighbors at each iteration (which affects the result little), the respective cost per iteration is O(N kD) and O(N kD + N (D + k) min(D, k)2 ), thus linear in N . The graph is constructed on the initial X we use, consisting of the present values and an imputation for the missing ones achieved with a standard matrix completion method, and has a one-off cost of O(N 2 D). The cost when we have a fraction µ = |M| ∈ [0, 1] of missing data is simply the above times µ. Hence the run time ND of our mean-shift-based matrix completion algorithms is faster the more present data we have, and thus faster than the usual GBMS or MBMS case, where all data are effectively missing. 3 Experimental results We compare with representative methods of several approaches: a low-rank matrix completion method, singular value projection (SVP [7], whose performance we found similar to that of alternating least squares, ALS [3, 4]); fitting a D-dimensional Gaussian model with EM and imputing the missing values of each xn as the conditional mean E {xn,Mn |xn,Pn } (we use the implementation of [19]); and the nonlinear method of [20] (nlPCA). We initialize GBMS and MBMS from some or all of these algorithms. For methods with user parameters, we set them by cross-validation in the following way: we randomly select 10% of the present entries and pretend they are missing as well, we run the algorithm on the remaining 90% of the present values, and we evaluate the reconstruction at the 10% entries we kept earlier. We repeat this over different parameters’ values and pick the one with lowest reconstruction error. We then run the algorithm with these parameters values on the entire present data and report the (test) error with the ground truth for the missing values. 100D Swissroll We created a 3D swissroll data set with 3 000 points and lifted it to 100D with a random orthonormal mapping, and added a little noise (spherical Gaussian with stdev 0.1). We selected uniformly at random 6.76% of the entries to be present. We use the Gaussian model and SVP (fixed rank = 3) as initialization for our algorithm. We typically find that these initial X are very noisy (fig. 3), with some reconstructed points lying between different branches of the manifold and causing a big reconstruction error. We fixed L = 2 (the known dimensionality) for MBMS and cross-validated the other parameters: σ and k for MBMS and GBMS (both using k-nn graph), and the number of iterations τ to be used. Table 1 gives the performance of MBMS and GBMS for testing, along with their optimal parameters. Fig. 3 shows the results of different methods at a few iterations. MBMS initialized from the Gaussian model gives the most remarkable denoising effect. To show that there is a wide range of σ and number of iterations τ that give good performance with GBMS and MBMS, we fix k = 50 and run the algorithm with varying σ values and plot the reconstruction error for missing entries over iterations in fig. 2. Both GBMS can achieve good 5 Methods Gaussian + GBMS (∞, 10, 0, 1) + MBMS (1, 20, 2, 25) SVP + GBMS (3, 50, 0, 1) + MBMS (3, 50, 2, 2) RSSE 168.1 165.8 157.2 156.8 151.4 151.8 mean 2.63 2.57 2.36 1.94 1.89 1.87 stdev 1.59 1.61 1.63 2.10 2.02 2.05 Methods nlPCA SVP + GBMS (400,140,0,1) + MBMS (500,140,9,5) Table 1: Swissroll data set: reconstruction errors obtained by different algorithms along with their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries, the mean, and the standard deviation of the pointwise reconstruction error, resp. SVP + GBMS error (RSSE) 180 170 SVP + MBMS Gaussian + GBMS 180 180 170 170 170 160 160 ∞ 160 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ stdev 42.6 39.3 37.7 34.9 Gaussian + MBMS 180 8 10 15 25 mean 26.1 21.8 18.8 17.0 Table 2: MNIST-7 data set: errors of the different algorithms and their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries (×10−4 ), the mean, and the standard deviation of pixel errors, respectively. 160 0.3 0.5 1 2 3 5 RSSE 7.77 6.99 6.54 6.03 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ Figure 2: Reconstruction error of GBMS/MBMS over iterations (each curve is a different σ value). denoising (and reconstruction), but MBMS is more robust, with good results occurring for a wide range of iterations, indicating it is able to preserve the manifold structure better. Mocap data We use the running-motion sequence 09 01 from the CMU mocap database with 148 samples (≈ 1.7 cycles) with 150 sensor readings (3D positions of 50 joints on a human body). The motion is intrinsically 1D, tracing a loop in 150D. We compare nlPCA, SVP, the Gaussian model, and MBMS initialized from the first three algorithms. For nlPCA, we do a grid search for the weight decay coefficient while fixing its structure to be 2 × 10 × 150 units, and use an early stopping criterion. For SVP, we do grid search on {1, 2, 3, 5, 7, 10} for the rank. For MBMS (L = 1) and GBMS (L = 0), we do grid search for σ and k. We report the reconstruction error as a function of the proportion of missing entries from 50% to 95%. For each missing-data proportion, we randomly select 5 different sets of present values and run all algorithms for them. Fig. 4 gives the mean errors of all algorithms. All methods perform well when missing-data proportion is small. nlPCA, being prone to local optima, is less stable than SVP and the Gaussian model, especially when the missing-data proportion is large. The Gaussian model gives the best and most stable initialization. At 95%, all methods fail to give an acceptable reconstruction, but up to 90% missing entries, MBMS and GBMS always beat the other algorithms. Fig. 4 shows selected reconstructions from all algorithms. MNIST digit ‘7’ The MNIST digit ‘7’ data set contains 6 265 greyscale (0–255) images of size 28 × 28. We create missing entries in a way reminiscent of run-length errors in transmission. We generate 16 to 26 rectangular boxes of an area approximately 25 pixels at random locations in each image and use them to black out pixels. In this way, we create a high dimensional data set (784 dimensions) with about 50% entries missing on average. Because of the loss of spatial correlations within the blocks, this missing data pattern is harder than random. The Gaussian model cannot handle such a big data set because it involves inverting large covariance matrices. nlPCA is also very slow and we cannot afford cross-validating its structure or the weight decay coefficient, so we picked a reasonable structure (10 × 30 × 784 units), used the default weight decay parameter in the code (10−3 ), and allowed up to 500 iterations. We only use SVP as initialization for our algorithm. Since the intrinsic dimension of MNIST is suspected to be not very high, 6 SVP τ =0 SVP + GBMS τ =1 SVP + MBMS τ =2 Gaussian τ =0 Gaussian + GBMS τ =1 Gaussian + MBMS τ = 25 20 20 20 20 20 20 15 15 15 15 15 15 10 10 10 10 10 10 5 5 5 5 5 5 0 0 0 0 0 0 −5 −5 −5 −5 −5 −5 −10 −10 −15 −15 −10 −5 0 5 10 15 20 −10 −15 −15 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −5 0 5 10 15 20 Figure 3: Denoising effect of the different algorithms. For visualization, we project the 100D data to 3D with the projection matrix used for creating the data. Present values are refilled for all plots. 7000 6000 error 5000 4000 frame 2 (leg distance) frame 10 (foot pose) frame 147 (leg pose) nlPCA nlPCA + GBMS nlPCA + MBMS SVP SVP + GBMS SVP + MBMS Gaussian Gaussian + GBMS Gaussian + MBMS 3000 2000 1000 0 50 60 70 80 85 90 95 % of missing data Figure 4: Left: mean of errors (RSSE) of 5 runs obtained by different algorithms for varying percentage of missing values. Errorbars shown only for Gaussian + MBMS to avoid clutter. Right: sample reconstructions when 85% percent data is missing. Row 1: initialization. Row 2: init+GBMS. Row 3: init+MBMS. Color indicates different initialization: black, original data; red, nlPCA; blue, SVP; green, Gaussian. we used rank 10 for SVP and L = 9 for MBMS. We also use the same k = 140 as in [5]. So we only had to choose σ and the number of iterations via cross-validation. Table 2 shows the methods and their corresponding error. Fig. 5 shows some representative reconstructions from different algorithms, with present values refilled. The mean-shift averaging among closeby neighbors (a soft form of majority voting) helps to eliminate noise, unusual strokes and other artifacts created by SVP, which by their nature tend to occur in different image locations over the neighborhood of images. 4 Related work Matrix completion is widely studied in theoretical compressed sensing [1, 2] as well as practical recommender systems [3, 4]. Most matrix completion models rely on a low-rank assumption, and cannot fully exploit a more complex structure of the problem, such as curved manifolds. Related work is on multi-task learning in a broad sense, which extracts the common structure shared by multiple related objects and achieves simultaneous learning on them. This includes applications such as alignment of noise-corrupted images [21], recovery of images with occlusion [22], and even learning of multiple related regressors or classifiers [23]. Again, all these works are essentially based on a subspace assumption, and do not generalize to more complex situations. A line of work based on a nonlinear low-rank assumption (with a latent variable z of dimensionN 2 ality L < D) involves setting up a least-squares error function minf ,Z n=1 xn − f (zn ) = N,D 2 n,d=1 (xdn − fd (zn )) where one ignores the terms for which xdn is missing, and estimates the function f and the low-dimensional data projections Z by alternating optimization. Linear functions f have been used in the homogeneity analysis literature [24], where this approach is called “missing data deleted”. Nonlinear functions f have been used recently (neural nets [20]; Gaussian processes for collaborative filtering [25]). Better results are obtained if adding a projection term N 2 and optimizing over the missing data as well [26]. n=1 zn − F(xn ) 7 Orig Missing nlPCA SVP GBMS MBMS Orig Missing nlPCA SVP GBMS MBMS Figure 5: Selected reconstructions of MNIST block-occluded digits ‘7’ with different methods. Prior to our denoising-based work there have been efforts to extend the low-rank models to smooth manifolds, mostly in the context of compressed sensing. Baraniuk and Wakin [27] show that certain random measurements, e.g. random projection to a low-dimensional subspace, can preserve the metric of the manifold fairly well, if the intrinsic dimension and the curvature of the manifold are both small enough. However, these observations are not suitable for matrix completion and no algorithm is given for recovering the signal. Chen et al. [28] explicitly model a pre-determined manifold, and use this to regularize the signal when recovering the missing values. They estimate the manifold given complete data, while no complete data is assumed in our matrix completion setting. Another related work is [29], where the manifold modeled with Isomap is used in estimating the positions of satellite cameras in an iterative manner. Finally, our expectation that the value of a missing entry can be predicted from the values of neighboring points is similar to one category of collaborative filtering methods that essentially use similar users/items to predict missing values [3, 4]. 5 Conclusion We have proposed a new paradigm for matrix completion, denoising, which generalizes the commonly used assumption of low rank. Assuming low-rank implies a restrictive form of denoising where the data is forced to have zero variance away from a linear manifold. More general definitions of denoising can potentially handle data that lives in a low-dimensional manifold that is nonlinear, or whose dimensionality varies (e.g. a set of manifolds), or that does not have low rank at all, and naturally they handle noise in the data. Denoising works because of the fundamental fact that a missing value can be predicted by averaging nearby present values. Although we motivate our framework from a constrained optimization point of view (denoise subject to respecting the present data), we argue for an algorithmic view of denoising-based matrix completion: apply a denoising step, refill the present values, iterate until the validation error increases. In turn, this allows different forms of denoising, such as based on low-rank projection (earlier work) or local averaging with blurring mean-shift (this paper). Our nonparametric choice of mean-shift averaging further relaxes assumptions about the data and results in a simple algorithm with very few user parameters that afford user control (denoising scale, local dimensionality) but can be set automatically by cross-validation. Our algorithms are intended to be used as a postprocessing step over a user-provided initialization of the missing values, and we show they consistently improve upon existing algorithms. The MBMS-based algorithm bridges the gap between pure denoising (GBMS) and local low rank. Other definitions of denoising should be possible, for example using temporal as well as spatial neighborhoods, and even applicable to discrete data if we consider denoising as a majority voting among the neighbours of a vector (with suitable definitions of votes and neighborhood). Acknowledgments Work supported by NSF CAREER award IIS–0754089. 8 References [1] Emmanuel J. Cand` s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations e of Computational Mathematics, 9(6):717–772, December 2009. [2] Emmanuel J. Cand` s and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. e IEEE Trans. Information Theory, 56(5):2053–2080, April 2010. [3] Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative filtering model. SIGKDD 2008, pages 426–434, Las Vegas, NV, August 24–27 2008. [4] Robert Bell and Yehuda Koren. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. ICDM 2007, pages 43–52, October 28–31 2007. ´ [5] Weiran Wang and Miguel A. Carreira-Perpi˜ an. Manifold blurring mean shift algorithms for manifold n´ denoising. CVPR 2010, pages 1759–1766, San Francisco, CA, June 13–18 2010. [6] Matthias Hein and Markus Maier. Manifold denoising. NIPS 2006, 19:561–568. MIT Press, 2007. [7] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank minimization via singular value projection. NIPS 2010, 23:937–945. MIT Press, 2011. ´ [8] Miguel A. Carreira-Perpi˜ an. Fast nonparametric clustering with Gaussian blurring mean-shift. ICML n´ 2006, pages 153–160. Pittsburgh, PA, June 25–29 2006. [9] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Information Theory, 21(1):32–40, January 1975. [10] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. PAMI, 17(8):790–799, 1995. [11] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI, 24(5):603–619, May 2002. ´ [12] Miguel A. Carreira-Perpi˜ an. Generalised blurring mean-shift algorithms for nonparametric clustering. n´ CVPR 2008, Anchorage, AK, June 23–28 2008. [13] Gabriel Taubin. A signal processing approach to fair surface design. SIGGRAPH 1995, pages 351–358. [14] Mathieu Desbrun, Mark Meyer, Peter Schr¨ der, and Alan H. Barr. Implicit fairing of irregular meshes o using diffusion and curvature flow. SIGGRAPH 1999, pages 317–324. [15] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, Providence, RI, 1997. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888– 905, August 2000. [17] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1986. [18] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York, second edition, 2006. [19] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5):853–871, March 2001. [20] Matthias Scholz, Fatma Kaplan, Charles L. Guy, Joachim Kopka, and Joachim Selbig. Non-linear PCA: A missing data approach. Bioinformatics, 21(20):3887–3895, October 15 2005. [21] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. CVPR 2010, pages 763–770, 2010. [22] A. M. Buchanan and A. W. Fitzgibbon. Damped Newton algorithms for matrix factorization with missing data. CVPR 2005, pages 316–322, San Diego, CA, June 20–25 2005. [23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. NIPS 2006, 19:41–48. MIT Press, 2007. [24] Albert Gifi. Nonlinear Multivariate Analysis. John Wiley & Sons, 1990. [25] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes. ICML 2009, Montreal, Canada, June 14–18 2009. ´ [26] Miguel A. Carreira-Perpi˜ an and Zhengdong Lu. Manifold learning and missing data recovery through n´ unsupervised regression. ICDM 2011, December 11–14 2011. [27] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9(1):51–77, February 2009. [28] Minhua Chen, Jorge Silva, John Paisley, Chunping Wang, David Dunson, and Lawrence Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Processing, 58(12):6140–6155, December 2010. [29] Michael B. Wakin. A manifold lifting algorithm for multi-view compressive imaging. In Proc. 27th Conference on Picture Coding Symposium (PCS’09), pages 381–384, 2009. 9
4 0.56940776 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
Author: Yong Zhang, Zhaosong Lu
Abstract: In this paper we consider general rank minimization problems with rank appearing in either objective function or constraint. We first show that a class of matrix optimization problems can be solved as lower dimensional vector optimization problems. As a consequence, we establish that a class of rank minimization problems have closed form solutions. Using this result, we then propose penalty decomposition methods for general rank minimization problems. The convergence results of the PD methods have been shown in the longer version of the paper [19]. Finally, we test the performance of our methods by applying them to matrix completion and nearest low-rank correlation matrix problems. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed. 1
5 0.53612149 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
Author: Binbin Lin, Chiyuan Zhang, Xiaofei He
Abstract: This paper studies the problem of semi-supervised learning from the vector field perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient field of a linear function has to be a parallel vector field. Consequently, we propose to find a function which minimizes the empirical error, and simultaneously requires its gradient field to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efficiently. The experimental results have demonstrated the effectiveness of our proposed approach. 1
6 0.525464 73 nips-2011-Divide-and-Conquer Matrix Factorization
7 0.50101775 287 nips-2011-The Manifold Tangent Classifier
8 0.47927329 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
9 0.475182 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
10 0.4529247 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
11 0.44435498 102 nips-2011-Generalised Coupled Tensor Factorisation
12 0.43615821 165 nips-2011-Matrix Completion for Multi-label Image Classification
13 0.43551165 270 nips-2011-Statistical Performance of Convex Tensor Decomposition
14 0.42862368 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise
15 0.38860875 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
16 0.37588018 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation
17 0.36835742 264 nips-2011-Sparse Recovery with Brownian Sensing
18 0.36737487 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
19 0.36009017 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
20 0.35309872 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent
topicId topicWeight
[(0, 0.046), (4, 0.028), (10, 0.011), (20, 0.027), (26, 0.026), (31, 0.044), (33, 0.014), (43, 0.054), (45, 0.084), (57, 0.034), (74, 0.037), (83, 0.064), (87, 0.363), (99, 0.077)]
simIndex simValue paperId paperTitle
same-paper 1 0.77180654 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion
Author: Nicolas Boumal, Pierre-antoine Absil
Abstract: We consider large matrices of low rank. We address the problem of recovering such matrices when most of the entries are unknown. Matrix completion finds applications in recommender systems. In this setting, the rows of the matrix may correspond to items and the columns may correspond to users. The known entries are the ratings given by users to some items. The aim is to predict the unobserved ratings. This problem is commonly stated in a constrained optimization framework. We follow an approach that exploits the geometry of the low-rank constraint to recast the problem as an unconstrained optimization problem on the Grassmann manifold. We then apply first- and second-order Riemannian trust-region methods to solve it. The cost of each iteration is linear in the number of known entries. Our methods, RTRMC 1 and 2, outperform state-of-the-art algorithms on a wide range of problem instances. 1
2 0.58662492 153 nips-2011-Learning large-margin halfspaces with more malicious noise
Author: Phil Long, Rocco Servedio
Abstract: We describe a simple algorithm that runs in time poly(n, 1/γ, 1/ε) and learns an unknown n-dimensional γ-margin halfspace to accuracy 1 − ε in the presence of malicious noise, when the noise rate is allowed to be as high as Θ(εγ log(1/γ)). Previous efficient algorithms could only learn to accuracy ε in the presence of malicious noise of rate at most Θ(εγ). Our algorithm does not work by optimizing a convex loss function. We show that no algorithm for learning γ-margin halfspaces that minimizes a convex proxy for misclassification error can tolerate malicious noise at a rate greater than Θ(εγ); this may partially explain why previous algorithms could not achieve the higher noise tolerance of our new algorithm. 1
3 0.57607335 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
Author: Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, Bart Selman
Abstract: We propose a novel Adaptive Markov Chain Monte Carlo algorithm to compute the partition function. In particular, we show how to accelerate a flat histogram sampling technique by significantly reducing the number of “null moves” in the chain, while maintaining asymptotic convergence properties. Our experiments show that our method converges quickly to highly accurate solutions on a range of benchmark instances, outperforming other state-of-the-art methods such as IJGP, TRW, and Gibbs sampling both in run-time and accuracy. We also show how obtaining a so-called density of states distribution allows for efficient weight learning in Markov Logic theories. 1
4 0.51624089 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits
Author: Yevgeny Seldin, Peter Auer, John S. Shawe-taylor, Ronald Ortner, François Laviolette
Abstract: We derive an instantaneous (per-round) data-dependent regret bound for stochastic multiarmed bandits with side information (also known as contextual bandits). The p scaling of our regret bound with the number of states (contexts) N goes as N I⇢t (S; A), where I⇢t (S; A) is the mutual information between states and actions (the side information) used by the algorithm at round t. If the algorithm p uses all the side information, the regret bound scales as N ln K, where K is the number of actions (arms). However, if the side information I⇢t (S; A) is not fully used, the regret bound is significantly tighter. In the extreme case, when I⇢t (S; A) = 0, the dependence on the number of states reduces from linear to logarithmic. Our analysis allows to provide the algorithm large amount of side information, let the algorithm to decide which side information is relevant for the task, and penalize the algorithm only for the side information that it is using de facto. We also present an algorithm for multiarmed bandits with side information with O(K) computational complexity per game round. 1
5 0.44360554 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
6 0.40161619 257 nips-2011-SpaRCS: Recovering low-rank and sparse matrices from compressive measurements
7 0.39700398 186 nips-2011-Noise Thresholds for Spectral Clustering
8 0.38911191 102 nips-2011-Generalised Coupled Tensor Factorisation
9 0.38844258 270 nips-2011-Statistical Performance of Convex Tensor Decomposition
10 0.38456458 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
11 0.38190997 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
12 0.38084236 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
13 0.38060227 202 nips-2011-On the Universality of Online Mirror Descent
14 0.37978062 263 nips-2011-Sparse Manifold Clustering and Embedding
15 0.3788327 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
16 0.37865332 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
17 0.37849599 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
18 0.37739822 271 nips-2011-Statistical Tests for Optimization Efficiency
19 0.37725231 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
20 0.37707543 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries