jmlr jmlr2012 jmlr2012-108 knowledge-graph by maker-knowledge-mining

108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing

Source: pdf

Author: Nicolas Gillis

Abstract: Nonnegative matrix factorization (NMF) has become a very popular technique in machine learning because it automatically extracts meaningful features through a sparse and part-based representation. However, NMF has the drawback of being highly ill-posed, that is, there typically exist many different but equivalent factorizations. In this paper, we introduce a completely new way to obtaining more well-posed NMF problems whose solutions are sparser. Our technique is based on the preprocessing of the nonnegative input data matrix, and relies on the theory of M-matrices and the geometric interpretation of NMF. This approach provably leads to optimal and sparse solutions under the separability assumption of Donoho and Stodden (2003), and, for rank-three matrices, makes the number of exact factorizations ﬁnite. We illustrate the effectiveness of our technique on several image data sets. Keywords: nonnegative matrix factorization, data preprocessing, uniqueness, sparsity, inversepositive matrices

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Our technique is based on the preprocessing of the nonnegative input data matrix, and relies on the theory of M-matrices and the geometric interpretation of NMF. [sent-6, score-0.327]

2 Keywords: nonnegative matrix factorization, data preprocessing, uniqueness, sparsity, inversepositive matrices 1. [sent-9, score-0.268]

3 Introduction Given an m-by-n nonnegative matrix M ≥ 0 and a factorization rank r, nonnegative matrix factorization (NMF) looks for two nonnegative matrices U and V of dimension m-by-r and r-by-n respectively such that M ≈ UV . [sent-10, score-0.889]

4 Since M: j ≈ ∑r U:kVk j ∀ j, each column M: j of M is reconstructed using an additive linear combination of k=1 nonnegative basis elements (the columns of U). [sent-13, score-0.315]

5 Therefore, the columns of θ(M) are convex combinations (linear combinations with nonnegative weights summing to one) of the columns of θ(U). [sent-38, score-0.372]

6 An exact NMF M = UV can then be geometrically interpreted as a polytope T = conv(θ(U)) nested between an inner polytope conv(θ(M)) and an outer polytope ∆m . [sent-40, score-0.394]

7 Hence ﬁnding the minimal number of nonnegative rank-one factors to reconstruct M exactly is equivalent to ﬁnding a polytope T with minimum number of vertices nested between two given polytopes: the inner polytope conv(θ(M)) and the outer polytope ∆m . [sent-41, score-0.648]

8 This problem is referred to as the nested polytopes problem (NPP), and is then equivalent to computing an exact nonnegative matrix factorization (Hazewinkel, 1984); see also Gillis and Glineur (2012a) and the references therein. [sent-42, score-0.359]

9 In the remaining of the paper, we will denote NPP(M) the NPP instance corresponding to M with inner polytope conv(θ(M)) and outer polytope ∆m . [sent-43, score-0.228]

10 Geometrically, requiring the matrix U to be sparse is equivalent to requiring the vertices of the nested polytope conv(θ(U)) to be located on the low-dimensional faces of the outer polytope ∆m , hence making the problem more well posed. [sent-61, score-0.522]

11 Volume maximization of conv(θ(U)) is also possible, leading to a sparser factor U (since the columns of U will be encouraged to be on the faces of ∆m ), see Wang et al. [sent-69, score-0.217]

12 Geometrically, it amounts to position the vertices of conv(θ(U)) on the low-dimensional faces of ∆m so that if one of the columns of θ(U) is not on a facet of ∆m (that is, Uik > 0 for some i, k), then all the other columns of U must be on that facet (that is, Uip = 0 ∀p = k). [sent-78, score-0.338]

13 Our technique is based on a preprocessing of the input matrix M to make it sparser while preserving its nonnegativity and its column space. [sent-84, score-0.325]

14 The motivation is based on the geometric interpretation of NMF which shows that sparser matrices will correspond to more well-posed NMF problems whose solutions are sparser. [sent-85, score-0.201]

15 In Section 3, we introduce a preprocessing P (M) = MQ of M where Q is an inverse-positive matrix, that is, Q has full rank and its inverse Q−1 is nonnegative. [sent-88, score-0.25]

16 Moreover, in the exact case for rank-three matrices (that is, M = UV and rank(M) = 3), we show how the preprocessing can be used to obtain an equivalent NMF problem with a ﬁnite number of solutions. [sent-91, score-0.201]

17 Non-Uniqueness, Geometry and Sparsity Let M ∈ Rm×n and (U,V ) ∈ Rm×r × Rr×n be an exact nonnegative matrix factorization of M = + + + UV . [sent-95, score-0.296]

18 The minimum r such that such a decomposition exists is the nonnegative rank of M and will be denoted rank+ (M). [sent-96, score-0.29]

19 However, if all columns of conv(θ(M)) are located on k-dimensional faces of T having exactly k + 1 vertices, then the convex combinations given by V are unique (Sun and Xin, 2011). [sent-99, score-0.203]

20 It is interesting to notice that the columns of M containing zero entries are located on the boundary of the outer polytope ∆m , and these points must be on the boundary of any solution T of NPP(M). [sent-107, score-0.374]

21 In particular, Donoho and Stodden (2003) showed that “requiring that some of the data are spread across the faces of the nonnegative orthant, there is unique simplicial cone”, that is, there is a unique conv(θ(U)). [sent-109, score-0.254]

22 If r columns of θ(M) coincide with r different vertices of ∆m ∩ col(θ(M)), then the exact NMF of M is unique. [sent-112, score-0.232]

23 Since r columns of θ(M) coincide with r vertices of ∆m ∩ col(θ(M)), we have that conv(θ(U)) = conv(θ(M)) is the unique solution of NPP(M), and Theorem 3 allows to conclude. [sent-115, score-0.257]

24 It is interesting to notice that this result implies that the only 3-by-3 rank-three nonnegative matrices having a unique exact NMF are the monomial matrices (permutation and scaling of the identity matrix) since all other matrices have at least two distinct exact NMF: M = MI = IM. [sent-131, score-0.447]

25 Finally, although sparsity is neither a necessary (see Remark 7 below) nor a sufﬁcient condition for uniqueness (except in some cases, see for example Theorem 6 or Donoho and Stodden, 2003), the geometric interpretation of NMF shows that sparser matrices M lead to more well-posed NMF problems. [sent-132, score-0.25]

26 In fact, many points of the inner polytope in NPP(M) are located on the boundary of the outer polytope ∆m . [sent-133, score-0.292]

27 As it was shown in the previous paragraph, this can be achieved by working with sparser nonnegative matrices. [sent-148, score-0.261]

28 For this reason, we will restrict the search space to the subset of Z-matrices, that is, inversepositive matrices of the form Q = sI − B, where s is a nonnegative scalar, I is the identity matrix of appropriate dimension and B is a nonnegative matrix such that ρ(B) < s, see Section 3. [sent-200, score-0.487]

29 (7) This means that we will subtract from each column of M a nonnegative linear combination of the other columns of M in order to maximize its sparsity while keeping its nonnegativity. [sent-215, score-0.366]

30 Properties of the Preprocessing In the remainder of the paper, we denote B ∗ (M) the set of optimal solutions of problem (8) for the data matrix M, and P the preprocessing operator deﬁned as P : Rm×n → Rm×n : M → P (M) = M(I − B∗ ), where B∗ ∈ B ∗ (M). [sent-228, score-0.216]

31 • The preprocessing operator P is invariant to permutation and scaling of the columns of M (Lemma 16). [sent-230, score-0.275]

32 • If the matrix M is separable, then the preprocessing allows to recover a sparse and optimal solution of the corresponding NMF problem (Theorem 24). [sent-234, score-0.203]

33 • If the matrix has rank-three, then the preprocessing yields an instance in which the number of solutions of the exact NMF problem is ﬁnite (Theorem 29). [sent-236, score-0.237]

34 Another important property of the preprocessing is its invariance to permutation and scaling of the columns of M. [sent-242, score-0.275]

35 Lemma 16 Let M be a nonnegative matrix and P be a monomial matrix. [sent-243, score-0.264]

36 We extend the deﬁnition to matrices with zero columns as follows: θ(X) is the matrix whose columns are the normalized non-zero columns of X, that is, letting Y be the matrix X where the non-zero columns have been removed, we deﬁne θ(X) = θ(Y ). [sent-265, score-0.497]

37 Another straightforward property is that the preprocessing can only inﬂate the convex hull deﬁned by the columns of θ(M). [sent-267, score-0.265]

38 Because vertices of θ(M) are non-repeated, we have M:i ∈ conv(θ(M(:, J ))), while / P (M):i = M:i − ∑ bki M:k k=i ⇐⇒ M:i = P (M):i + ∑ bki M:k . [sent-275, score-0.295]

39 In fact, a preprocessed zero column remains zero while it cannot inﬂuence the preprocessing of the other columns (see Equation (7)). [sent-281, score-0.4]

40 We now prove that if no column of M is multiple of another column (that is, the columns of θ(M) are distinct) then ρ(B∗ ) < 1 for any B∗ ∈ B ∗ (M) whence Q = I − B∗ is an inverse positive matrix. [sent-285, score-0.2]

41 We can assume without loss of generality that the r ﬁrst columns of M correspond to the vertices of conv(θ(M)). [sent-377, score-0.211]

42 1 12 3362 S PARSE AND U NIQUE NMF T HROUGH DATA P REPROCESSING In fact, by assumption, the last columns of M belong to the convex cone of the r ﬁrst ones and can then be set to zero (which is optimal) using only the ﬁrst r columns (cf. [sent-379, score-0.23]

43 Lemma 20 applies on matrix Q1 and M(:, 1:r) since MQ(:, 1:r) = M(:, 1:r) − M(:, 1:r)B∗ ≥ 0, 1 while by assumption no column of M(:, 1:r) belong to the convex hull of the other columns, so that Q1 is strictly diagonally dominant hence is a nonsingular M-matrix. [sent-381, score-0.26]

44 In that case, the preprocessing is unique and the preprocessed matrix has the same rank as the original one. [sent-383, score-0.445]

45 In fact, given an NMF (U,V ′ ) of the preprocessed matrix P (M) = MQ ≈ UV ′ , we can obtain the optimal factor V for matrix M by solving the nonnegative least squares problem V = argminX≥0 ||M −UX||2 (instead of taking V = V ′ Q−1 ) and obtain F M ≈ UV . [sent-385, score-0.412]

46 2 Recovery Under Separability A nonnegative matrix M is called separable if it can be written as M = UV where U ∈ Rm×r , + V ∈ Rr×n , and for each i = 1, . [sent-387, score-0.241]

47 (2012) showed that the NMF problem corresponding to a separable nonnegative matrix can be solved in polynomial time (while NMF is NP-hard in general; see Introduction). [sent-393, score-0.241]

48 In this section, we show that the preprocessing is able to solve this problem while generating a sparser solution than the one obtained with the algorithm of Arora et al. [sent-394, score-0.245]

49 Geometrically, separabilty means that the vertices of conv(θ(M)) are given by the columns of θ(U). [sent-399, score-0.211]

50 Theorem 24 shows that the preprocessing is able to identify the r columns of M = UV corresponding to the vertices of θ(M). [sent-407, score-0.342]

51 Moreover, it returns a sparser matrix S, namely P (U), whose cone contains the columns of M. [sent-408, score-0.255]

52 Corollary 25 For any rank-two nonnegative matrix M whose columns are not multiples of each other, P (M) has only two non-zero columns, say S:1 and S:2 such that conv(θ(M)) ⊆ conv(θ(S)), that is, there exists R ≥ 0 such that M = SR. [sent-410, score-0.33]

53 In other words, the preprocessing technique is optimal as it is able to identify an optimal nonnegative basis for the NMF problem corresponding to the matrix M. [sent-411, score-0.35]

54 3 Uniqueness and Robustness Through Preprocessing A potential drawback of the preprocessing is that it might increase the nonnegative rank of M. [sent-445, score-0.421]

55 1 2 Lemma 26 Let M be a nonnegative matrix such that the vertices of conv(θ(M)) are non-repeated. [sent-449, score-0.342]

56 Lemma 27 Let M be a nonnegative matrix such that the vertices of conv(θ(M)) are non-repeated, then the supremum ¯ α = sup α such that rank+ (P α (M)) = rank+ (M) (11) 0≤α≤1 is attained. [sent-454, score-0.342]

57 In fact, if M:i = 0 for some i then P α (M):i = 0 for all α ∈ [0, 1] so that the nonnegative rank of P α (M) is not affected by the zero columns of M. [sent-456, score-0.378]

58 we have P :i Finally, the result follows from the upper-semicontinuity of the nonnegative rank (Bocci et al. [sent-459, score-0.29]

59 1): ‘If P is a nonnegative matrix, without zero columns and with rank+ (P) = k, then there exists a ball B (P, ε) centered at P and of radius ε > 0 such that rank+ (N) ≥ k for all N ∈ B (P, ε)’. [sent-461, score-0.259]

60 ¯ Hence working with matrix P α (M) instead of M will reduce the number of solutions of the NMF problem while preserving the nonnegative rank: Theorem 28 Let M be a nonnegative matrix for which the vertices of conv(θ(M)) are non-repeated, ¯ ¯ let also α be deﬁned as in Equation (11). [sent-463, score-0.598]

61 In Example 2, the vertices of M can be perturbed and, as long as they remain inside the square ¯ ¯ deﬁned by conv(P α (M)) (see Figure 2), the exact NMF of conv(P α (M)) will provide an exact NMF for the perturbed matrix M. [sent-486, score-0.213]

62 (1989) showed that x(t1 ) can be taken as a vertex of a feasible solution of ¯ NPP(P α (M)) with k vertices if and only if fk (t1 ) = tk+1 ≥ t1 + 1, that is, we were able to turn around Q inside P in k + 1 steps (in fact, x(t1 ), x(t2 ), . [sent-509, score-0.252]

63 (1989) also showed that the function fk is continuous, non-decreasing, and depends continuously on the vertices of Q (see also Appendix A). [sent-514, score-0.196]

64 ¯ Moreover, the vertices of Q are located on the boundary of P (because α = 1) on at least two different sides of P (three vertices cannot be on the same side). [sent-526, score-0.31]

65 Intuitively, the preprocessing P α (M) of M grows the inner polytope Q as long as the corresponding NPP instance has a solution with rank+ (M) vertices. [sent-563, score-0.246]

66 In fact, checking whether the nonnegative rank of an m-by-n is equal to rank(M) can be done in polynomial time in m and n provided that the rank is ﬁxed (Arora et al. [sent-569, score-0.409]

67 2 Normalization of the Columns of P (M) Since the aim eventually is to provide a good approximate NMF to the original data matrix M, we observed that normalizing the columns of the preprocessed matrix P (M) to match the norm of the corresponding columns of M gives better results. [sent-590, score-0.397]

68 This scaling does not change the nice properties of the preprocessing since D is a monomial matrix, hence QD still is an inverse-positive matrix. [sent-592, score-0.216]

69 This is particularly critical if there are outliers in the data set: the outliers do not look similar to the other columns of M hence their preprocessing will not reduce much their ℓ2 -norm (because they are further away from the convex cone generated by the other columns of M). [sent-601, score-0.381]

70 99 In practice, this technique also allows to obtain preprocessed matrices with more entries equal or smaller than zero. [sent-611, score-0.2]

71 When choosing the parameter ε, it is very important to check whether ρ(B∗ ) < 1 ε so that the rank of Pε (M) is equal to the rank of M and no information is lost (we can recover the original matrix M = Pε (M)(I − B∗ )−1 given Pε (M) and B∗ ). [sent-612, score-0.286]

72 As we will see, this will highlight certain localized parts of these images, essentially because the preprocessed matrices are sparser than the original ones. [sent-616, score-0.264]

73 We will then show that combining the preprocessing with standard NMF algorithms naturally leads to better part-based decompositions, because sparser matrices lead to sparser NMF solutions, see Section 2. [sent-617, score-0.36]

74 A direct comparison between NMF applied on the original matrix and NMF applied on the preprocessed matrix is not very informative in itself: while the former will feature a lower approximation error, the latter will provide a sparser part-based representation. [sent-618, score-0.311]

75 ) Notice that the preprocessed matrix may contain negative entries (when ε > 0) which is handled by A-HALS. [sent-633, score-0.199]

76 Therefore, when indicating the sparsity of the preprocessed matrix, negative entries will be counted as zeros as they lead to even sparser NMF decompositions. [sent-637, score-0.292]

77 The negative entries of the preprocessed matrix Pε (M) for ε > 0 will be counted as zeros. [sent-659, score-0.199]

78 7 Table 1 reports the sparsity and the value of ρ(B∗ ) for the preprocessed matrices with different ε values of the parameter ε. [sent-668, score-0.225]

79 90 Table 1: CBCL data set: sparsity of the preprocessed matrices Pε (M) = MQ and corresponding spectral radius of B∗ = I − Q. [sent-680, score-0.247]

80 9979 Table 3: Hubble data set: sparsity of the preprocessed matrices Pε (M) = MD and corresponding spectral radius of B∗ = I − D. [sent-794, score-0.247]

81 It is based on the preprocessing of the nonnegative data matrix M: 3379 G ILLIS given M, we compute an inverse positive matrix Q such that the preprocessed matrix P (M) = MQ remains nonnegative and is sparse. [sent-845, score-0.742]

82 We proved that the preprocessing is well-deﬁned, invariant to permutation and scaling of the columns of matrix M, and preserves the rank of M (as long as the vertices of conv(θ(M)) are non repeated). [sent-847, score-0.565]

83 In particular, we were able to show that • Under the separability assumption of Donoho and Stodden (2003), the preprocessing is optimal as it identiﬁes the vertices of the convex hull of the columns of M. [sent-849, score-0.41]

84 • Since any rank-two matrix satisﬁes the separability assumption, the preprocessing is optimal for any nonnegative rank-two matrix. [sent-850, score-0.372]

85 Moreover, it generates sparser preprocessed matrices hence sparser NMF solutions. [sent-854, score-0.374]

86 12 Another possibility would be to use the following heuristic: since the preprocessing removes from each column of M a linear combinations of the other columns, one could use only a subset of k columns of M to be subtracted from the other columns of M. [sent-863, score-0.363]

87 3380 S PARSE AND U NIQUE NMF T HROUGH DATA P REPROCESSING ple, the matrix13  0 1 1 M= 1 0 1  1 1 0  would not be modiﬁed by our preprocessing (because each column contains a zero entry corresponding to positive ones in all other columns) although its NMF is not unique (cf. [sent-873, score-0.228]

88 This example shows that working with a larger set of inverse positive matrices would allow to obtain sparser preprocessed data matrices, hence more well-posed NMF problems with sparser solutions. [sent-876, score-0.374]

89 These points correspond to the vertices of P (P has at most m vertices since it is a polygon deﬁned with m inequalities); or, 13. [sent-899, score-0.306]

90 These points where the description of fk changes (and where fk is not continuously differentiable) are called the contact change points. [sent-903, score-0.197]

91 Low-dimensional polytope approximation and its applications to nonnegative matrix factorization. [sent-997, score-0.31]

92 On the equivalence of nonnegative matrix factorization and spectral clustering. [sent-1021, score-0.297]

93 Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. [sent-1069, score-0.219]

94 Fast and robust recursive algorithms for separable nonnegative matrix factorization. [sent-1075, score-0.241]

95 Minimum dispersion constrained nonnegative matrix factorization to unmix hyperspectral data. [sent-1092, score-0.322]

96 Learning the parts of objects by nonnegative matrix factorization. [sent-1129, score-0.219]

97 Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization. [sent-1134, score-0.219]

98 Two algorithms for orthogonal nonnegative matrix factorization with application to clustering. [sent-1163, score-0.275]

99 Guaranteed minimum rank solutions to linear matrix equations via nuclear norm minimization. [sent-1172, score-0.204]

100 Minimum-volume-constrained nonnegative matrix factorization: Enhanced ability of learning parts. [sent-1228, score-0.219]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nmf', 0.63), ('conv', 0.387), ('npp', 0.186), ('col', 0.182), ('nonnegative', 0.171), ('uv', 0.167), ('snmf', 0.159), ('preprocessing', 0.131), ('illis', 0.126), ('preprocessed', 0.125), ('vertices', 0.123), ('hrough', 0.12), ('nique', 0.12), ('rank', 0.119), ('mq', 0.104), ('reprocessing', 0.102), ('gillis', 0.1), ('polytope', 0.091), ('sparser', 0.09), ('columns', 0.088), ('bki', 0.086), ('aggarwal', 0.085), ('fk', 0.073), ('clls', 0.066), ('hubble', 0.066), ('polygon', 0.06), ('parse', 0.057), ('factorization', 0.056), ('column', 0.056), ('cbcl', 0.053), ('mb', 0.051), ('contact', 0.051), ('sparsity', 0.051), ('matrices', 0.049), ('matrix', 0.048), ('hyperspectral', 0.047), ('outer', 0.046), ('monomial', 0.045), ('stodden', 0.04), ('nonsingular', 0.04), ('faces', 0.039), ('solutions', 0.037), ('permutation', 0.036), ('arora', 0.036), ('boundary', 0.035), ('nested', 0.035), ('uniqueness', 0.035), ('bii', 0.034), ('constitutive', 0.033), ('glineur', 0.033), ('tk', 0.032), ('vertex', 0.032), ('rr', 0.032), ('images', 0.031), ('mp', 0.031), ('diagonally', 0.031), ('donoho', 0.029), ('located', 0.029), ('cone', 0.029), ('implying', 0.029), ('facets', 0.028), ('materials', 0.028), ('polytopes', 0.028), ('rm', 0.028), ('irreducibly', 0.027), ('entries', 0.026), ('geometric', 0.025), ('convex', 0.025), ('solution', 0.024), ('multiples', 0.023), ('remote', 0.023), ('laurberg', 0.023), ('unique', 0.022), ('spectral', 0.022), ('ud', 0.022), ('separable', 0.022), ('separability', 0.022), ('exact', 0.021), ('hull', 0.021), ('intersections', 0.021), ('ding', 0.021), ('scaling', 0.02), ('squares', 0.02), ('hence', 0.02), ('berman', 0.02), ('geoscience', 0.02), ('lsqlin', 0.02), ('mpp', 0.02), ('pauca', 0.02), ('polygons', 0.02), ('quadprog', 0.02), ('telescope', 0.02), ('uq', 0.02), ('vavasis', 0.02), ('strictly', 0.019), ('entry', 0.019), ('bp', 0.019), ('irreducible', 0.019), ('ti', 0.019), ('geometrically', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing

Author: Nicolas Gillis

2 0.27084213 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization

Author: Marinka Žitnik, Blaž Zupan

Abstract: NIMFA is an open-source Python library that provides a uniﬁed interface to nonnegative matrix factorization algorithms. It includes implementations of state-of-the-art factorization methods, initialization approaches, and quality scoring. It supports both dense and sparse matrix representation. NIMFA’s component-based implementation and hierarchical design should help the users to employ already implemented techniques or design and code new strategies for matrix factorization tasks. Keywords: nonnegative matrix factorization, initialization methods, quality measures, scripting, Python

3 0.050909273 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efﬁciently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-deﬁnite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classiﬁcation task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning

4 0.048477974 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff

Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they deﬁne the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm

5 0.041048463 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

Author: Sahand Negahban, Martin J. Wainwright

Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisﬁes a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative ﬁltering, convex optimization

6 0.041038256 88 jmlr-2012-PREA: Personalized Recommendation Algorithms Toolkit

7 0.036509871 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

8 0.034053579 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

9 0.033860017 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

10 0.032747239 24 jmlr-2012-Causal Bounds and Observable Constraints for Non-deterministic Models

11 0.031668469 3 jmlr-2012-A Geometric Approach to Sample Compression

12 0.031499203 103 jmlr-2012-Sampling Methods for the Nyström Method

13 0.030155063 97 jmlr-2012-Regularization Techniques for Learning with Matrices

14 0.029848071 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

15 0.029132614 94 jmlr-2012-Query Strategies for Evading Convex-Inducing Classifiers

16 0.027611062 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

17 0.027139176 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

18 0.026536029 91 jmlr-2012-Plug-in Approach to Active Learning

19 0.026354035 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning

20 0.02494148 60 jmlr-2012-Local and Global Scaling Reduce Hubs in Space

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.128), (1, 0.045), (2, 0.102), (3, -0.008), (4, -0.053), (5, 0.055), (6, 0.047), (7, -0.11), (8, 0.102), (9, -0.37), (10, -0.287), (11, -0.228), (12, 0.142), (13, 0.027), (14, 0.1), (15, -0.043), (16, -0.226), (17, 0.103), (18, 0.106), (19, 0.374), (20, 0.065), (21, -0.087), (22, -0.025), (23, 0.008), (24, 0.01), (25, -0.13), (26, -0.104), (27, 0.117), (28, -0.003), (29, -0.018), (30, 0.02), (31, -0.041), (32, 0.044), (33, 0.024), (34, 0.012), (35, 0.025), (36, -0.029), (37, -0.029), (38, -0.037), (39, -0.013), (40, -0.009), (41, 0.024), (42, 0.002), (43, -0.026), (44, 0.006), (45, 0.006), (46, -0.047), (47, 0.017), (48, -0.019), (49, -0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94272107 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing

Author: Nicolas Gillis

2 0.92315656 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization

Author: Marinka Žitnik, Blaž Zupan

3 0.23616773 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices

Author: Uri Shalit, Daphna Weinshall, Gal Chechik

4 0.22553435 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage

Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff

5 0.17486894 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization

Author: Karthik Mohan, Maryam Fazel

Abstract: The problem of minimizing the rank of a matrix subject to afﬁne constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to ﬁnd the minimum rank matrix under suitable assumptions. In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 ≤ p ≤ 1), as a computationally efﬁcient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p = 1, we give theoretical guarantees similar to those for nuclear norm minimization, that is, recovery of low-rank matrices under certain assumptions on the operator deﬁning the constraints. For p < 1, IRLSp shows better empirical performance in terms of recovering low-rank matrices than nuclear norm minimization. We provide an efﬁcient implementation for IRLS-p, and also present a related family of algorithms, sIRLS-p. These algorithms exhibit competitive run times and improved recovery when compared to existing algorithms for random instances of the matrix completion problem, as well as on the MovieLens movie recommendation data set. Keywords: matrix rank minimization, matrix completion, iterative algorithms, null-space property

6 0.17026162 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise

7 0.15575407 103 jmlr-2012-Sampling Methods for the Nyström Method

8 0.13703534 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions

9 0.13649204 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

10 0.13240831 88 jmlr-2012-PREA: Personalized Recommendation Algorithms Toolkit

11 0.12965338 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss

12 0.12846954 3 jmlr-2012-A Geometric Approach to Sample Compression

13 0.12527566 112 jmlr-2012-Structured Sparsity via Alternating Direction Methods

14 0.12509774 97 jmlr-2012-Regularization Techniques for Learning with Matrices

15 0.11865259 24 jmlr-2012-Causal Bounds and Observable Constraints for Non-deterministic Models

16 0.11429866 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

17 0.11408477 33 jmlr-2012-Distance Metric Learning with Eigenvalue Optimization

18 0.11098631 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation

19 0.11020085 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

20 0.10840953 40 jmlr-2012-Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.018), (21, 0.032), (26, 0.031), (27, 0.012), (29, 0.032), (35, 0.021), (49, 0.014), (56, 0.017), (64, 0.013), (69, 0.032), (75, 0.043), (77, 0.02), (79, 0.019), (92, 0.082), (96, 0.115), (99, 0.404)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69191664 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing

Author: Nicolas Gillis

2 0.40962282 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming

Author: Garvesh Raskutti, Martin J. Wainwright, Bin Yu

Abstract: Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f j∗ , where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f j∗ lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1 -type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn ) norms over the class Fd,s,H of sparse additive models with each univariate function f j∗ in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much √ faster estimation rates are possible for any sparsity s = Ω( n), showing that global boundedness is a signiﬁcant restriction in the high-dimensional setting. Keywords: sparsity, kernel, non-parametric, convex, minimax

3 0.36057457 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning

Author: Sangkyun Lee, Stephen J. Wright

Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is deﬁned by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identiﬁcation property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identiﬁcation

4 0.36037707 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches

Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao

Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization

5 0.3558422 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting

Author: Matus Telgarsky

Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: • Weak learnability aids the whole loss family: for any ε > 0, O (ln(1/ε)) iterations sufﬁce to produce a predictor with empirical risk ε-close to the inﬁmum; • The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O (ln(1/ε)); • Arbitrary instances may be decomposed into the above two, granting rate O (1/ε), with a matching lower bound provided for the logistic loss. Keywords: boosting, convex analysis, weak learnability, coordinate descent, maximum entropy

6 0.35379624 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints

7 0.35167107 73 jmlr-2012-Multi-task Regression using Minimal Penalties

8 0.35085082 103 jmlr-2012-Sampling Methods for the Nyström Method

9 0.35077506 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms

10 0.34879059 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models

11 0.34815907 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems

12 0.34799668 91 jmlr-2012-Plug-in Approach to Active Learning

13 0.34792858 80 jmlr-2012-On Ranking and Generalization Bounds

14 0.34682012 13 jmlr-2012-Active Learning via Perfect Selective Classification

15 0.34677696 36 jmlr-2012-Efficient Methods for Robust Classification Under Uncertainty in Kernel Matrices

16 0.3448846 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class

17 0.34469789 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods

18 0.34448081 84 jmlr-2012-Online Submodular Minimization

19 0.34343696 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers

20 0.3422192 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO