jmlr jmlr2013 jmlr2013-53 knowledge-graph by maker-knowledge-mining

53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Source: pdf

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semideﬁnite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 CN Department of Computer Science and Engineering Shanghai Jiao Tong University 800 Dong Chuan Road, Shanghai, China 200240 Editor: Mehryar Mohri Abstract The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. [sent-5, score-0.184]

2 The Nystr¨ m method approximates a symmetric positive semideﬁnite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. [sent-6, score-0.179]

3 o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. [sent-8, score-0.379]

4 Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling 1. [sent-12, score-0.433]

5 Given a matrix A ∈ Rm×n , column selection algorithms aim to choose c columns of A to construct a matrix C ∈ Rm×c such that A − CC† A ξ achieves the minimum. [sent-44, score-0.252]

6 Here “ξ = 2,” “ξ = F,” and “ξ = ∗” respectively represent the matrix spectral norm, the matrix Frobenius norm, and the matrix nuclear norm, and C† denotes the Moore-Penrose inverse of C. [sent-45, score-0.184]

7 Particularly, a CUR decomposition algorithm seeks to ﬁnd a subset of c columns of A to form a matrix C ∈ Rm×c , a subset of r rows to form a matrix R ∈ Rr×n , and an intersection ˜ matrix U ∈ Rc×r such that A − CUR ξ is small. [sent-70, score-0.315]

8 For example, for an m×n matrix A and a target rank k ≪ min{m, n}, the subspace sampling algorithm (Drineas et al. [sent-83, score-0.425]

9 The subspace sampling algorithm selects columns/rows according to the statistical leverage scores, so the computational cost of this algorithm is at least equal to the cost of the truncated SVD of A, that is, O (mnk) in general. [sent-87, score-0.392]

10 (2012) devised fast approximation to statistical leverage scores which can be used to speedup the subspace sampling algorithm heuristically—yet no theoretical results have been reported that the leverage scores approximation can give provably efﬁcient subspace sampling algorithm. [sent-90, score-0.84]

11 Despite their strong resemblance, CUR is a harder problem than column selection because “one can get good columns or rows separately” does not mean that one can get good columns and rows together. [sent-94, score-0.291]

12 Specifo ically, given an m×m SPSD matrix A, they require sampling c (< m) columns of A to construct an m × c matrix C. [sent-110, score-0.372]

13 Since there exists an m×m permutation matrix Π such that ΠC consists of the ﬁrst c columns of ΠAΠT , we always assume that C consists of the ﬁrst c columns of A without loss of generality. [sent-111, score-0.211]

14 Then the ensemble method combines the samples to construct an approximation in the form of ˜ ens At,c = t ∑ µ(i) C(i) W(i) † T C(i) , (2) i=1 where µ(i) are the weights of the samples. [sent-122, score-0.208]

15 The matrix multiplications can be executed very efﬁciently in multi-processor environment, so ideally computing the intersection matrix costs time only linear in m. [sent-129, score-0.179]

16 o o To generate effective approximations, much work has been built on the upper error bounds of the sampling techniques for the Nystr¨ m method. [sent-134, score-0.231]

17 4 Contributions and Outline The main technical contribution of this work is the adaptive sampling bound in Theorem 5, which is an extension of Theorem 2. [sent-149, score-0.352]

18 (2006) bounds the error incurred by projection onto column or row space, while our Theorem 5 bounds the error incurred by the projection simultaneously onto column space and row space. [sent-154, score-0.288]

19 More importantly, our adaptive sampling bound provides an approach for improving CUR and the Nystr¨ m approximation: no matter which relative-error column selection algorithm is employed, o Theorem 5 ensures relative-error bounds for CUR and the Nystr¨ m approximation. [sent-158, score-0.436]

20 Based on the adaptive sampling bound in Theorem 5 and its corollary 7, we provide a concrete CUR algorithm which beats the best existing algorithm—the subspace sampling algorithm—both theoretically and empirically. [sent-160, score-0.681]

21 In Table 1 we present a comparison between our proposed CUR algorithm and the subspace sampling algorithm. [sent-162, score-0.329]

22 Our method is more scalable for it works on only a few columns or rows of the data matrix in question; in contrast, the subspace sampling algorithm maintains the whole data matrix in RAM to implement SVD. [sent-164, score-0.545]

23 Another important application of the adaptive sampling bound is to yield an algorithm for the modiﬁed Nystr¨ m method. [sent-165, score-0.352]

24 The algorithm has a strong relative-error upper bound: for a target rank o k, by sampling 2k 1 + o(1) columns it achieves relative-error bound in expectation. [sent-166, score-0.32]

25 From the table we can see that the upper error bound of our adaptive sampling algorithm for the modiﬁed Nystr¨ m method is o even better than the lower bounds of the conventional Nystr¨ m methods. [sent-175, score-0.433]

26 Informally speaking, the columns (or rows) with high leverage scores have greater inﬂuence in rank-k approximation than those with low leverage scores. [sent-201, score-0.209]

27 1 we present an adaptive sampling algorithm and its relative-error bound established by Deshpande et al. [sent-214, score-0.352]

28 1 The Adaptive Sampling Algorithm Adaptive sampling is an effective and efﬁcient column sampling algorithm for reducing the error incurred by the ﬁrst round of sampling. [sent-224, score-0.465]

29 After one has selected a small subset of columns (denoted C1 ), an adaptive sampling method is used to further select a proportion of columns according to the residual of the ﬁrst round, that is, A − C1 C† A. [sent-225, score-0.513]

30 The approximation error is guaranteed to be 1 decreasing by a factor after the adaptive sampling (Deshpande et al. [sent-226, score-0.392]

31 We will establish in Theorem 5 a more general and more useful error bound for this adaptive sampling algorithm. [sent-242, score-0.379]

32 , 2011), and the adaptive sampling algorithm (Deshpande et al. [sent-255, score-0.335]

33 Given a matrix A ∈ Rm×n , the truncated pivoted QR decomposition procedure deterministically ﬁnds a set of columns C ∈ Rm×c by column pivoting, whose span approximates the column space of A, and computes an upper triangular matrix TC ∈ Rc×c that orthogonalizes those columns. [sent-277, score-0.383]

34 The sampling probabilities in the two stages are proportional to the leverage scores of A and C, respectively. [sent-291, score-0.236]

35 That is, in the ﬁrst stage the sampling probabilities are proportional to the squared ℓ2 -norm of the rows of VA,k ; in the second stage the sampling probabilities are proportional to the squared ℓ2 -norm of the rows of UC . [sent-292, score-0.424]

36 That is why it is called the subspace sampling algorithm. [sent-293, score-0.329]

37 Here we show the main results of the subspace sampling algorithm in the following lemma. [sent-294, score-0.329]

38 Lemma 3 (Subspace Sampling for CUR ) Given an m × n matrix A and a target rank k ≪ min{m, n}, the subspace sampling algorithm selects c = O (kε−2 log k log(1/δ)) columns and r = O cε−2 log c log(1/δ) rows without replacement. [sent-295, score-0.543]

39 2, the Nystr¨ m algorithm also samples columns by the o [k] subspace sampling of Drineas et al. [sent-304, score-0.41]

40 Here S ∈ Rm×c is a column selection matrix that si j = 1 if the i-th column of A is the j-th column selected, and D ∈ Rc×c is a diagonal scaling matrix satisfying d j j = √1 i if si j = 1. [sent-308, score-0.263]

41 cp Lemma 4 (Subspace Sampling for the Nystr¨ m Approximation) Given an m × m SPSD matrix o A and a target rank k ≪ m, the subspace sampling algorithm selects c = 3200ε−1 k log(16k/δ) columns without replacement and constructs C and W by scaling the selected columns. [sent-309, score-0.506]

42 We establish a new error bound for the adaptive sampling algorithm in Section 4. [sent-314, score-0.379]

43 We apply adaptive sampling to the CUR and modiﬁed Nystr¨ m problems, o obtaining effective and efﬁcient CUR and Nystr¨ m algorithms in Section 4. [sent-316, score-0.335]

44 1 Adaptive Sampling The relative-error adaptive sampling algorithm is originally established in Theorem 2. [sent-327, score-0.335]

45 Here we prove a 1 new and more general error bound for the same adaptive sampling algorithm. [sent-332, score-0.379]

46 Remark 6 This theorem shows a more general bound for adaptive sampling than the original one in Theorem 2. [sent-345, score-0.371]

47 The original one bounds the error incurred by projection onto the column space of C, while Theorem 5 bounds the error incurred by projection onto the column space of C and row space of R simultaneously—such situation rises in problems such as CUR and the Nystr¨ m approximation. [sent-348, score-0.288]

48 2, selecting good columns or rows separately does not ensure good columns and rows together for CUR and the Nystr¨ m approximation. [sent-355, score-0.236]

49 Based on Corollary 7, we attempt to solve CUR and the Nystr¨ m by adaptive sampling algoo rithms. [sent-364, score-0.335]

50 2 Adaptive Sampling for CUR Matrix Decomposition Guaranteed by the novel adaptive sampling bound in Theorem 5, we combine the near-optimal column selection algorithm of Boutsidis et al. [sent-371, score-0.407]

51 (2011) and the adaptive sampling algorithm for solving the CUR problem, giving rise to an algorithm with a much tighter theoretical bound than existing algorithms. [sent-372, score-0.352]

52 Theorem 8 (Adaptive Sampling for CUR) Given a matrix A ∈ Rm×n and a positive integer k ≪ min{m, n}, the CUR algorithm described in Algorithm 2 randomly selects c = 2k (1+o(1)) columns ε of A to construct C ∈ Rm×c , and then selects r = c (1+ε) rows of A to construct R ∈ Rr×n . [sent-375, score-0.203]

53 Neither the nearoptimal column selection algorithm nor the adaptive sampling algorithm requires loading the whole of A into RAM. [sent-380, score-0.39]

54 o Remark 9 If we replace the near-optimal column selection algorithm in Theorem 8 by the optimal algorithm of Guruswami and Sinop (2012), it sufﬁces to select c = kε−1 (1 + o(1)) columns and r = cε−1 (1 + ε) rows totally. [sent-384, score-0.191]

55 If one uses the optimal o column selection algorithm of Guruswami and Sinop (2012), which is less efﬁcient, the error bound is further improved: only c = εk2 (1 + o(1)) columns are required. [sent-401, score-0.198]

56 Let A denote ˜ trix A ∈ R k either the rank-c approximation to A constructed by the standard Nystr¨ m method in (1), or the o approximation constructed by the ensemble Nystr¨ m method in (2) with t non-overlapping samples, o each of which contains c columns of A. [sent-416, score-0.219]

57 Then there exists an SPSD matrix such that for any sampling ˜ strategy the approximation errors of the conventional Nystr¨ m methods, that is, A − A ξ , (ξ = 2, o F, or “∗”), are lower bounded by some factors which are shown in Table 3. [sent-417, score-0.279]

58 Repeating the sampling procedure for t times and letting X(i) correspond to the error ratio of the i-th sample, we obtain an upper bound on the failure probability: P min{X(i) } > 1 + sε i = P X(i) > 1 + sε ∀i = 1, · · · ,t < 1+ε 1 + sε t δ, (4) which decays exponentially with t. [sent-430, score-0.249]

59 2 we conduct empirical comparisons between the standard Nystr¨ m and our modiﬁed Nystr¨ m, and comparisons among three sampling algorithms. [sent-443, score-0.194]

60 1 Comparison among the CUR Algorithms In this section we empirically compare our adaptive sampling based CUR algorithm (Algorithm 2) with the subspace sampling algorithm of Drineas et al. [sent-451, score-0.664]

61 As for the subspace sampling algorithm, we compute the leverages scores exactly via the truncated SVD. [sent-454, score-0.379]

62 , 2012) can signiﬁcantly speedup subspace sampling, we do not use it because the approximation has no theoretical guarantee when applied to subspace sampling. [sent-456, score-0.338]

63 We repeat each of the two randomized algorithms 10 times, and report the minimum error ratio and the total elapsed time of the 10 rounds. [sent-538, score-0.192]

64 We depict the error ratios and the elapsed time of the three CUR matrix decomposition algorithms in Figures 1, 2, 3, and 4. [sent-539, score-0.237]

65 We can see from Figures 1, 2, 3, and 4 that our adaptive sampling based CUR algorithm has much lower approximation error than the subspace sampling algorithm in all cases. [sent-540, score-0.721]

66 Our adaptive sampling based algorithm is better than the deterministic SCRA on the Farm Ads data set and the Gisette data set, worse than SCRA on the Enron data set, and comparable to SCRA on the Dexter data set. [sent-541, score-0.358]

67 ≤ 1+ A − Ak F c a As for the running time, the subspace sampling algorithm and our adaptive sampling based algorithm are much more efﬁcient than SCRA, especially when c and r are large. [sent-544, score-0.664]

68 Our adaptive sampling based algorithm is comparable to the subspace sampling algorithm when c and r are small; however, our algorithm becomes less efﬁcient when c and r are large. [sent-545, score-0.664]

69 First, the computational cost of the subspace sampling algorithm is dominated by the truncated SVD of A, which is determined by the target rank k and the size and sparsity of the data matrix. [sent-547, score-0.402]

70 The four data sets are all very sparse, so the subspace sampling algorithm has advantages. [sent-551, score-0.329]

71 2 Comparison among the Nystr¨ m Algorithms o In this section we empirically compare our adaptive sampling algorithm (in Theorem 10) with some other sampling algorithms including the subspace sampling of Drineas et al. [sent-555, score-0.839]

72 We also conduct comparison between the standard Nystr¨ m o and our modiﬁed Nystr¨ m, both use the three sampling algorithms to select columns. [sent-557, score-0.194]

73 We run each algorithm for 10 times, and report the the minimum error ratio as well as the total elapsed time of the 10 repeats. [sent-613, score-0.192]

74 The standard deviation of the leverage scores reﬂects whether the advanced importance sampling techniques such as the subspace sampling and adaptive sampling are useful. [sent-623, score-0.919]

75 Figures 5, 6, and 7 show that the advantage of the subspace sampling and adaptive sampling over the uniform sampling is signiﬁcant whenever the standard deviation of the leverage scores is large (see Table 5), and vise versa. [sent-624, score-0.96]

76 The experimental results also show that the subspace sampling and adaptive sampling algorithms signiﬁcantly outperform the uniform sampling when c is reasonably small, say c < 10k. [sent-627, score-0.88]

77 This indicates that the subspace sampling and adaptive sampling algorithms are good at choosing “good” columns as basis vectors. [sent-628, score-0.745]

78 In most cases our adaptive sampling algorithm achieves the lowest approximation error among the three algorithms. [sent-631, score-0.392]

79 a As for the running time, our adaptive sampling algorithm is more efﬁcient than the subspace sampling algorithm. [sent-782, score-0.664]

80 This is partly because the RBF kernel matrix is dense, and hence the subspace sampling algorithm costs O (m2 k) time to compute the truncated SVD. [sent-783, score-0.461]

81 Conclusion In this paper we have built a novel and more general relative-error bound for the adaptive sampling algorithm. [sent-789, score-0.352]

82 To achieve relative-error bound, the best previous algorithm— the subspace sampling algorithm—requires c = O (kε−2 log k) columns and r = O (cε−2 log c) rows. [sent-792, score-0.41]

83 We have shown that our adaptive sampling algorithm for the modiﬁed Nystr¨ m achieves relative-error upper bound by sampling only c = 2kε−2 (1+o(1)) columns, which o even beats the lower error bounds of the standard Nystr¨ m and the ensemble Nystr¨ m. [sent-794, score-0.661]

84 Our proposed o o CUR and Nystr¨ m algorithms are scalable because they need only to maintain a small fraction of o columns or rows in RAM, and their time complexities are low provided that matrix multiplication can be highly efﬁciently executed. [sent-795, score-0.19]

85 Theorem 15 (The Adaptive Sampling Algorithm) Given a matrix A ∈ Rm×n and a matrix R ∈ Rr×n such that rank(R) = rank(AR† R) = ρ (ρ ≤ r ≤ m), let C1 ∈ Rm×c1 consist of c1 columns of A, and deﬁne the residual B = A − C1 C† A. [sent-831, score-0.195]

86 Let C2 ∈ Rm×c2 contain the c2 sampled columns and C = [C1 , C2 ] ∈ Rm×(c1 +c2 ) contain the columns of both C1 and C2 , all of which are columns of A. [sent-837, score-0.243]

87 2 The Proof of Corollary 7 Since C is constructed by columns of A and the column space of C is contained in the column space of A, we have rank(CC† A) = rank(C) = ρ ≤ c. [sent-872, score-0.191]

88 Then the adaptive sampling algorithm costs O nk2 ε−2 + TMultiply mnkε−1 time to construct R2 . [sent-894, score-0.41]

89 The near-optimal column selection algorithm costs O mk2 ε−4/3 + mk3 ε−2/3 + TMultiply m2 kε−2/3 time to select c1 = O (kε−1 ) columns of A construct C1 . [sent-900, score-0.211]

90 Then the adaptive sampling algorithm costs O mk2 ε−2 + TMultiply m2 kε−1 time to select c2 = O (kε−2 ) columns construct C2 . [sent-901, score-0.491]

91 ˆ Proof Let C consist of c column sampled from A and Ci consist of ci columns sampled from the i-th ˆ block diagonal matrix in A. [sent-983, score-0.219]

92 Without loss of generality, we assume Ci consists of the ﬁrst ci columns ˆ ˆ ˆ of B, and accordingly Wi consists of the top left ci × ci block of B. [sent-984, score-0.183]

93 T † ˜ ens ˜ ens where Bt,c = 1 ∑ti=1 C(i) W(i) C(i) . [sent-1027, score-0.202]

94 2763 WANG AND Z HANG ens o Figure 8: An illustration of the matrix B − Bt,c for the ensemble Nystr¨ m method where B is deﬁned in (9). [sent-1032, score-0.209]

95 For the ensemble Nystr¨ m o ens method without overlapping, the matrix B − Bt,c can always be partitioned into four regions as annotated. [sent-1035, score-0.209]

96 We can express it as ˜ ens B − Bt,c = B − T 1 t (i) (i) † (i) T 1 t ∑ C W C = t ∑ B − C(i) W† C(i) , t i=1 i=1 (18) ˜ ens and then a discreet examination reveals that B − Bt,c can be partitioned into four kinds of regions as illustrated in Figure 8. [sent-1039, score-0.202]

97 Therefore, the nuclear norm ens ) equals to the matrix trace, that is, ˜ of (B − Bt,c ˜ ens B − Bt,c ∗ ˜ ens = tr B − Bt,c t −1 (1 − η) + (m − tc) · (1 − η) = tc · t 1 c+ α = (1 − α)(m − c) , c + 1−α α which proves the nuclear norm bound in the lemma. [sent-1045, score-0.539]

98 ˆ (i) ˆ ˆ (i) Furthermore, since the matrix B − 1 ∑ti=1 C j W† C j j t ˜ ens diagonal matrix (A − At,c ) is also SPSD. [sent-1058, score-0.199]

99 Thus we have ˜ ens A − At,c ∗ = (1 − α) ∑ (p − ci ) i=1 1 ci + α ci + 1−α α T 2 , is SPSD by Lemma 23, so the block ≥ (1 − α)(m − c) 1 + k c + 1−α k α , which proves the nuclear norm bound in the theorem. [sent-1059, score-0.281]

100 Then there exists an m×m SPSD matrix A such that the relative-error ratio of the ensemble Nystr¨ m method is lower bounded o by ˜ ens A − At,c F A − Ak F ˜ ens A − At,c ∗ A − Ak ∗ † ≥ ≥ T m − 2c + c/t − k k(m − 2c + c/t) 1+ , m−k c2 k m−c 1+ , m−k c ˜ ens where At,c = 1 ∑ti=1 C(i) W(i) C(i) . [sent-1061, score-0.441]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cur', 0.501), ('nystr', 0.487), ('modified', 0.417), ('sampling', 0.175), ('ar', 0.164), ('adaptive', 0.16), ('subspace', 0.154), ('cc', 0.121), ('tmultiply', 0.113), ('elapsed', 0.112), ('ens', 0.101), ('drineas', 0.093), ('spsd', 0.093), ('ak', 0.091), ('ecomposition', 0.09), ('om', 0.09), ('ystr', 0.09), ('near', 0.088), ('columns', 0.081), ('mproving', 0.077), ('ct', 0.068), ('bnys', 0.068), ('mnk', 0.068), ('pproximation', 0.067), ('atrix', 0.06), ('ensemble', 0.059), ('deshpande', 0.058), ('mahoney', 0.058), ('column', 0.055), ('svd', 0.054), ('cw', 0.052), ('boutsidis', 0.05), ('matrix', 0.049), ('tc', 0.048), ('scra', 0.045), ('pc', 0.044), ('hang', 0.044), ('uniform', 0.041), ('rows', 0.037), ('leverage', 0.037), ('nuclear', 0.037), ('rm', 0.035), ('qr', 0.035), ('nys', 0.035), ('costs', 0.034), ('ci', 0.034), ('incurred', 0.033), ('vt', 0.031), ('bv', 0.031), ('approximation', 0.03), ('ratio', 0.03), ('bounds', 0.029), ('rank', 0.029), ('sparsi', 0.028), ('singular', 0.028), ('frobenius', 0.028), ('cj', 0.027), ('guruswami', 0.027), ('error', 0.027), ('truncated', 0.026), ('decomposition', 0.026), ('conventional', 0.025), ('rbf', 0.024), ('intersection', 0.024), ('span', 0.024), ('scores', 0.024), ('norm', 0.024), ('time', 0.023), ('deterministic', 0.023), ('anys', 0.023), ('sinop', 0.023), ('wang', 0.022), ('av', 0.022), ('ck', 0.021), ('im', 0.02), ('talwalkar', 0.019), ('theorem', 0.019), ('standard', 0.019), ('kumar', 0.019), ('optimal', 0.018), ('construct', 0.018), ('cuct', 0.018), ('pivoted', 0.018), ('target', 0.018), ('abalone', 0.017), ('rv', 0.017), ('apc', 0.017), ('pi', 0.017), ('bc', 0.017), ('bound', 0.017), ('wk', 0.016), ('rr', 0.016), ('stewart', 0.016), ('residual', 0.016), ('ops', 0.015), ('gittens', 0.015), ('blkdiag', 0.015), ('avr', 0.015), ('lemma', 0.015), ('rt', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

2 0.40476725 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

Abstract: This paper examines the efﬁcacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classiﬁcation with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization

3 0.061525665 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

4 0.06004997 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

Abstract: Unions of subspaces provide a powerful generalization of single subspace models for collections of high-dimensional data; however, learning multiple subspaces from data is challenging due to the fact that segmentation—the identiﬁcation of points that live in the same subspace—and subspace estimation must be performed simultaneously. Recently, sparse recovery methods were shown to provide a provable and robust strategy for exact feature selection (EFS)—recovering subsets of points from the ensemble that live in the same subspace. In parallel with recent studies of EFS with ℓ1 -minimization, in this paper, we develop sufﬁcient conditions for EFS with a greedy method for sparse signal recovery known as orthogonal matching pursuit (OMP). Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. In particular, we demonstrate that sparse recovery methods provide signiﬁcant advantages over NN methods and that the gap between the two approaches is particularly pronounced when the sampling of subspaces in the data set is sparse. Our results suggest that OMP may be employed to reliably recover exact feature sets in a number of regimes where NN approaches fail to reveal the subspace membership of points in the ensemble. Keywords: subspace clustering, unions of subspaces, hybrid linear models, sparse approximation, structured sparsity, nearest neighbors, low-rank approximation

5 0.03800888 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

Author: Nima Noorshams, Martin J. Wainwright

Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP ﬁxed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefﬁcients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP ﬁxed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefﬁcients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical ﬂow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion

6 0.037900161 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.031978197 106 jmlr-2013-Stationary-Sparse Causality Network Learning

8 0.031297613 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

9 0.030863581 114 jmlr-2013-The Rate of Convergence of AdaBoost

10 0.030712619 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

11 0.030659508 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

12 0.03020256 8 jmlr-2013-A Theory of Multiclass Boosting

13 0.02800444 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

14 0.026601091 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

15 0.025999237 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

16 0.025777778 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

17 0.02254216 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

18 0.021059969 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

19 0.020991957 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

20 0.020773007 90 jmlr-2013-Quasi-Newton Method: A New Direction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.147), (1, 0.106), (2, 0.033), (3, -0.229), (4, -0.033), (5, 0.063), (6, -0.072), (7, -0.173), (8, 0.233), (9, 0.413), (10, 0.511), (11, 0.042), (12, 0.223), (13, -0.042), (14, -0.099), (15, 0.083), (16, -0.009), (17, 0.075), (18, 0.109), (19, -0.025), (20, -0.071), (21, -0.058), (22, -0.003), (23, 0.014), (24, 0.031), (25, 0.018), (26, -0.001), (27, -0.005), (28, -0.042), (29, -0.006), (30, -0.063), (31, -0.025), (32, 0.001), (33, 0.025), (34, -0.024), (35, 0.018), (36, -0.022), (37, -0.017), (38, 0.017), (39, 0.03), (40, -0.007), (41, -0.025), (42, 0.007), (43, -0.011), (44, -0.008), (45, -0.013), (46, 0.023), (47, -0.016), (48, 0.022), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95983011 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

2 0.86996657 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

3 0.27908397 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

4 0.20827514 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

5 0.1529886 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel

Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efﬁcient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efﬁcient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indeﬁnite kernel, similarity, completely positive

6 0.15140536 106 jmlr-2013-Stationary-Sparse Causality Network Learning

7 0.14633353 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

8 0.14565216 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

9 0.12832524 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

10 0.12516725 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

11 0.1235923 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

12 0.11873133 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

13 0.11482601 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

14 0.11299884 114 jmlr-2013-The Rate of Convergence of AdaBoost

15 0.11056688 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions

16 0.10891166 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

17 0.10580428 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

18 0.098798677 8 jmlr-2013-A Theory of Multiclass Boosting

19 0.094300658 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

20 0.094219156 76 jmlr-2013-Nonparametric Sparsity and Regularization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (5, 0.131), (6, 0.038), (10, 0.071), (23, 0.021), (34, 0.415), (67, 0.025), (68, 0.016), (70, 0.024), (75, 0.024), (85, 0.02), (87, 0.045), (89, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.65110123 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

2 0.5736903 73 jmlr-2013-Multicategory Large-Margin Unified Machines

Author: Chong Zhang, Yufeng Liu

Abstract: Hard and soft classiﬁers are two important groups of techniques for classiﬁcation problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classiﬁers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classiﬁcation task or not. In particular, soft classiﬁers predict the label based on the obtained class conditional probabilities, while hard classiﬁers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classiﬁcation, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Uniﬁed Machine (LUM) was recently proposed as a uniﬁed family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classiﬁers. For multicategory cases, however, the concept of soft and hard classiﬁcation becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classiﬁcation under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classiﬁcation and its transition behavior from soft to hard classiﬁers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classiﬁcation, large-margin, soft classiﬁcation, support vector machine

3 0.36769724 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

4 0.36059776 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efﬁcient algorithms for distributed optimization in statistical settings involving large-scale data sets. The ﬁrst algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efﬁciently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

5 0.3553623 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

6 0.35453182 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

7 0.35373864 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

8 0.35182342 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

9 0.35134998 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

10 0.35019368 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

11 0.35010305 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

12 0.34872395 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

13 0.34861237 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

14 0.34758419 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

15 0.34695584 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

16 0.34649178 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

17 0.34575644 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

18 0.3457011 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

19 0.34458673 15 jmlr-2013-Bayesian Canonical Correlation Analysis

20 0.3445583 114 jmlr-2013-The Rate of Convergence of AdaBoost