jmlr jmlr2013 jmlr2013-53 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shusen Wang, Zhihua Zhang
Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semidefinite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling
Reference: text
sentIndex sentText sentNum sentScore
1 CN Department of Computer Science and Engineering Shanghai Jiao Tong University 800 Dong Chuan Road, Shanghai, China 200240 Editor: Mehryar Mohri Abstract The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. [sent-5, score-0.184]
2 The Nystr¨ m method approximates a symmetric positive semidefinite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. [sent-6, score-0.179]
3 o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. [sent-8, score-0.379]
4 Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling 1. [sent-12, score-0.433]
5 Given a matrix A ∈ Rm×n , column selection algorithms aim to choose c columns of A to construct a matrix C ∈ Rm×c such that A − CC† A ξ achieves the minimum. [sent-44, score-0.252]
6 Here “ξ = 2,” “ξ = F,” and “ξ = ∗” respectively represent the matrix spectral norm, the matrix Frobenius norm, and the matrix nuclear norm, and C† denotes the Moore-Penrose inverse of C. [sent-45, score-0.184]
7 Particularly, a CUR decomposition algorithm seeks to find a subset of c columns of A to form a matrix C ∈ Rm×c , a subset of r rows to form a matrix R ∈ Rr×n , and an intersection ˜ matrix U ∈ Rc×r such that A − CUR ξ is small. [sent-70, score-0.315]
8 For example, for an m×n matrix A and a target rank k ≪ min{m, n}, the subspace sampling algorithm (Drineas et al. [sent-83, score-0.425]
9 The subspace sampling algorithm selects columns/rows according to the statistical leverage scores, so the computational cost of this algorithm is at least equal to the cost of the truncated SVD of A, that is, O (mnk) in general. [sent-87, score-0.392]
10 (2012) devised fast approximation to statistical leverage scores which can be used to speedup the subspace sampling algorithm heuristically—yet no theoretical results have been reported that the leverage scores approximation can give provably efficient subspace sampling algorithm. [sent-90, score-0.84]
11 Despite their strong resemblance, CUR is a harder problem than column selection because “one can get good columns or rows separately” does not mean that one can get good columns and rows together. [sent-94, score-0.291]
12 Specifo ically, given an m×m SPSD matrix A, they require sampling c (< m) columns of A to construct an m × c matrix C. [sent-110, score-0.372]
13 Since there exists an m×m permutation matrix Π such that ΠC consists of the first c columns of ΠAΠT , we always assume that C consists of the first c columns of A without loss of generality. [sent-111, score-0.211]
14 Then the ensemble method combines the samples to construct an approximation in the form of ˜ ens At,c = t ∑ µ(i) C(i) W(i) † T C(i) , (2) i=1 where µ(i) are the weights of the samples. [sent-122, score-0.208]
15 The matrix multiplications can be executed very efficiently in multi-processor environment, so ideally computing the intersection matrix costs time only linear in m. [sent-129, score-0.179]
16 o o To generate effective approximations, much work has been built on the upper error bounds of the sampling techniques for the Nystr¨ m method. [sent-134, score-0.231]
17 4 Contributions and Outline The main technical contribution of this work is the adaptive sampling bound in Theorem 5, which is an extension of Theorem 2. [sent-149, score-0.352]
18 (2006) bounds the error incurred by projection onto column or row space, while our Theorem 5 bounds the error incurred by the projection simultaneously onto column space and row space. [sent-154, score-0.288]
19 More importantly, our adaptive sampling bound provides an approach for improving CUR and the Nystr¨ m approximation: no matter which relative-error column selection algorithm is employed, o Theorem 5 ensures relative-error bounds for CUR and the Nystr¨ m approximation. [sent-158, score-0.436]
20 Based on the adaptive sampling bound in Theorem 5 and its corollary 7, we provide a concrete CUR algorithm which beats the best existing algorithm—the subspace sampling algorithm—both theoretically and empirically. [sent-160, score-0.681]
21 In Table 1 we present a comparison between our proposed CUR algorithm and the subspace sampling algorithm. [sent-162, score-0.329]
22 Our method is more scalable for it works on only a few columns or rows of the data matrix in question; in contrast, the subspace sampling algorithm maintains the whole data matrix in RAM to implement SVD. [sent-164, score-0.545]
23 Another important application of the adaptive sampling bound is to yield an algorithm for the modified Nystr¨ m method. [sent-165, score-0.352]
24 The algorithm has a strong relative-error upper bound: for a target rank o k, by sampling 2k 1 + o(1) columns it achieves relative-error bound in expectation. [sent-166, score-0.32]
25 From the table we can see that the upper error bound of our adaptive sampling algorithm for the modified Nystr¨ m method is o even better than the lower bounds of the conventional Nystr¨ m methods. [sent-175, score-0.433]
26 Informally speaking, the columns (or rows) with high leverage scores have greater influence in rank-k approximation than those with low leverage scores. [sent-201, score-0.209]
27 1 we present an adaptive sampling algorithm and its relative-error bound established by Deshpande et al. [sent-214, score-0.352]
28 1 The Adaptive Sampling Algorithm Adaptive sampling is an effective and efficient column sampling algorithm for reducing the error incurred by the first round of sampling. [sent-224, score-0.465]
29 After one has selected a small subset of columns (denoted C1 ), an adaptive sampling method is used to further select a proportion of columns according to the residual of the first round, that is, A − C1 C† A. [sent-225, score-0.513]
30 The approximation error is guaranteed to be 1 decreasing by a factor after the adaptive sampling (Deshpande et al. [sent-226, score-0.392]
31 We will establish in Theorem 5 a more general and more useful error bound for this adaptive sampling algorithm. [sent-242, score-0.379]
32 , 2011), and the adaptive sampling algorithm (Deshpande et al. [sent-255, score-0.335]
33 Given a matrix A ∈ Rm×n , the truncated pivoted QR decomposition procedure deterministically finds a set of columns C ∈ Rm×c by column pivoting, whose span approximates the column space of A, and computes an upper triangular matrix TC ∈ Rc×c that orthogonalizes those columns. [sent-277, score-0.383]
34 The sampling probabilities in the two stages are proportional to the leverage scores of A and C, respectively. [sent-291, score-0.236]
35 That is, in the first stage the sampling probabilities are proportional to the squared ℓ2 -norm of the rows of VA,k ; in the second stage the sampling probabilities are proportional to the squared ℓ2 -norm of the rows of UC . [sent-292, score-0.424]
36 That is why it is called the subspace sampling algorithm. [sent-293, score-0.329]
37 Here we show the main results of the subspace sampling algorithm in the following lemma. [sent-294, score-0.329]
38 Lemma 3 (Subspace Sampling for CUR ) Given an m × n matrix A and a target rank k ≪ min{m, n}, the subspace sampling algorithm selects c = O (kε−2 log k log(1/δ)) columns and r = O cε−2 log c log(1/δ) rows without replacement. [sent-295, score-0.543]
39 2, the Nystr¨ m algorithm also samples columns by the o [k] subspace sampling of Drineas et al. [sent-304, score-0.41]
40 Here S ∈ Rm×c is a column selection matrix that si j = 1 if the i-th column of A is the j-th column selected, and D ∈ Rc×c is a diagonal scaling matrix satisfying d j j = √1 i if si j = 1. [sent-308, score-0.263]
41 cp Lemma 4 (Subspace Sampling for the Nystr¨ m Approximation) Given an m × m SPSD matrix o A and a target rank k ≪ m, the subspace sampling algorithm selects c = 3200ε−1 k log(16k/δ) columns without replacement and constructs C and W by scaling the selected columns. [sent-309, score-0.506]
42 We establish a new error bound for the adaptive sampling algorithm in Section 4. [sent-314, score-0.379]
43 We apply adaptive sampling to the CUR and modified Nystr¨ m problems, o obtaining effective and efficient CUR and Nystr¨ m algorithms in Section 4. [sent-316, score-0.335]
44 1 Adaptive Sampling The relative-error adaptive sampling algorithm is originally established in Theorem 2. [sent-327, score-0.335]
45 Here we prove a 1 new and more general error bound for the same adaptive sampling algorithm. [sent-332, score-0.379]
46 Remark 6 This theorem shows a more general bound for adaptive sampling than the original one in Theorem 2. [sent-345, score-0.371]
47 The original one bounds the error incurred by projection onto the column space of C, while Theorem 5 bounds the error incurred by projection onto the column space of C and row space of R simultaneously—such situation rises in problems such as CUR and the Nystr¨ m approximation. [sent-348, score-0.288]
48 2, selecting good columns or rows separately does not ensure good columns and rows together for CUR and the Nystr¨ m approximation. [sent-355, score-0.236]
49 Based on Corollary 7, we attempt to solve CUR and the Nystr¨ m by adaptive sampling algoo rithms. [sent-364, score-0.335]
50 2 Adaptive Sampling for CUR Matrix Decomposition Guaranteed by the novel adaptive sampling bound in Theorem 5, we combine the near-optimal column selection algorithm of Boutsidis et al. [sent-371, score-0.407]
51 (2011) and the adaptive sampling algorithm for solving the CUR problem, giving rise to an algorithm with a much tighter theoretical bound than existing algorithms. [sent-372, score-0.352]
52 Theorem 8 (Adaptive Sampling for CUR) Given a matrix A ∈ Rm×n and a positive integer k ≪ min{m, n}, the CUR algorithm described in Algorithm 2 randomly selects c = 2k (1+o(1)) columns ε of A to construct C ∈ Rm×c , and then selects r = c (1+ε) rows of A to construct R ∈ Rr×n . [sent-375, score-0.203]
53 Neither the nearoptimal column selection algorithm nor the adaptive sampling algorithm requires loading the whole of A into RAM. [sent-380, score-0.39]
54 o Remark 9 If we replace the near-optimal column selection algorithm in Theorem 8 by the optimal algorithm of Guruswami and Sinop (2012), it suffices to select c = kε−1 (1 + o(1)) columns and r = cε−1 (1 + ε) rows totally. [sent-384, score-0.191]
55 If one uses the optimal o column selection algorithm of Guruswami and Sinop (2012), which is less efficient, the error bound is further improved: only c = εk2 (1 + o(1)) columns are required. [sent-401, score-0.198]
56 Let A denote ˜ trix A ∈ R k either the rank-c approximation to A constructed by the standard Nystr¨ m method in (1), or the o approximation constructed by the ensemble Nystr¨ m method in (2) with t non-overlapping samples, o each of which contains c columns of A. [sent-416, score-0.219]
57 Then there exists an SPSD matrix such that for any sampling ˜ strategy the approximation errors of the conventional Nystr¨ m methods, that is, A − A ξ , (ξ = 2, o F, or “∗”), are lower bounded by some factors which are shown in Table 3. [sent-417, score-0.279]
58 Repeating the sampling procedure for t times and letting X(i) correspond to the error ratio of the i-th sample, we obtain an upper bound on the failure probability: P min{X(i) } > 1 + sε i = P X(i) > 1 + sε ∀i = 1, · · · ,t < 1+ε 1 + sε t δ, (4) which decays exponentially with t. [sent-430, score-0.249]
59 2 we conduct empirical comparisons between the standard Nystr¨ m and our modified Nystr¨ m, and comparisons among three sampling algorithms. [sent-443, score-0.194]
60 1 Comparison among the CUR Algorithms In this section we empirically compare our adaptive sampling based CUR algorithm (Algorithm 2) with the subspace sampling algorithm of Drineas et al. [sent-451, score-0.664]
61 As for the subspace sampling algorithm, we compute the leverages scores exactly via the truncated SVD. [sent-454, score-0.379]
62 , 2012) can significantly speedup subspace sampling, we do not use it because the approximation has no theoretical guarantee when applied to subspace sampling. [sent-456, score-0.338]
63 We repeat each of the two randomized algorithms 10 times, and report the minimum error ratio and the total elapsed time of the 10 rounds. [sent-538, score-0.192]
64 We depict the error ratios and the elapsed time of the three CUR matrix decomposition algorithms in Figures 1, 2, 3, and 4. [sent-539, score-0.237]
65 We can see from Figures 1, 2, 3, and 4 that our adaptive sampling based CUR algorithm has much lower approximation error than the subspace sampling algorithm in all cases. [sent-540, score-0.721]
66 Our adaptive sampling based algorithm is better than the deterministic SCRA on the Farm Ads data set and the Gisette data set, worse than SCRA on the Enron data set, and comparable to SCRA on the Dexter data set. [sent-541, score-0.358]
67 ≤ 1+ A − Ak F c a As for the running time, the subspace sampling algorithm and our adaptive sampling based algorithm are much more efficient than SCRA, especially when c and r are large. [sent-544, score-0.664]
68 Our adaptive sampling based algorithm is comparable to the subspace sampling algorithm when c and r are small; however, our algorithm becomes less efficient when c and r are large. [sent-545, score-0.664]
69 First, the computational cost of the subspace sampling algorithm is dominated by the truncated SVD of A, which is determined by the target rank k and the size and sparsity of the data matrix. [sent-547, score-0.402]
70 The four data sets are all very sparse, so the subspace sampling algorithm has advantages. [sent-551, score-0.329]
71 2 Comparison among the Nystr¨ m Algorithms o In this section we empirically compare our adaptive sampling algorithm (in Theorem 10) with some other sampling algorithms including the subspace sampling of Drineas et al. [sent-555, score-0.839]
72 We also conduct comparison between the standard Nystr¨ m o and our modified Nystr¨ m, both use the three sampling algorithms to select columns. [sent-557, score-0.194]
73 We run each algorithm for 10 times, and report the the minimum error ratio as well as the total elapsed time of the 10 repeats. [sent-613, score-0.192]
74 The standard deviation of the leverage scores reflects whether the advanced importance sampling techniques such as the subspace sampling and adaptive sampling are useful. [sent-623, score-0.919]
75 Figures 5, 6, and 7 show that the advantage of the subspace sampling and adaptive sampling over the uniform sampling is significant whenever the standard deviation of the leverage scores is large (see Table 5), and vise versa. [sent-624, score-0.96]
76 The experimental results also show that the subspace sampling and adaptive sampling algorithms significantly outperform the uniform sampling when c is reasonably small, say c < 10k. [sent-627, score-0.88]
77 This indicates that the subspace sampling and adaptive sampling algorithms are good at choosing “good” columns as basis vectors. [sent-628, score-0.745]
78 In most cases our adaptive sampling algorithm achieves the lowest approximation error among the three algorithms. [sent-631, score-0.392]
79 a As for the running time, our adaptive sampling algorithm is more efficient than the subspace sampling algorithm. [sent-782, score-0.664]
80 This is partly because the RBF kernel matrix is dense, and hence the subspace sampling algorithm costs O (m2 k) time to compute the truncated SVD. [sent-783, score-0.461]
81 Conclusion In this paper we have built a novel and more general relative-error bound for the adaptive sampling algorithm. [sent-789, score-0.352]
82 To achieve relative-error bound, the best previous algorithm— the subspace sampling algorithm—requires c = O (kε−2 log k) columns and r = O (cε−2 log c) rows. [sent-792, score-0.41]
83 We have shown that our adaptive sampling algorithm for the modified Nystr¨ m achieves relative-error upper bound by sampling only c = 2kε−2 (1+o(1)) columns, which o even beats the lower error bounds of the standard Nystr¨ m and the ensemble Nystr¨ m. [sent-794, score-0.661]
84 Our proposed o o CUR and Nystr¨ m algorithms are scalable because they need only to maintain a small fraction of o columns or rows in RAM, and their time complexities are low provided that matrix multiplication can be highly efficiently executed. [sent-795, score-0.19]
85 Theorem 15 (The Adaptive Sampling Algorithm) Given a matrix A ∈ Rm×n and a matrix R ∈ Rr×n such that rank(R) = rank(AR† R) = ρ (ρ ≤ r ≤ m), let C1 ∈ Rm×c1 consist of c1 columns of A, and define the residual B = A − C1 C† A. [sent-831, score-0.195]
86 Let C2 ∈ Rm×c2 contain the c2 sampled columns and C = [C1 , C2 ] ∈ Rm×(c1 +c2 ) contain the columns of both C1 and C2 , all of which are columns of A. [sent-837, score-0.243]
87 2 The Proof of Corollary 7 Since C is constructed by columns of A and the column space of C is contained in the column space of A, we have rank(CC† A) = rank(C) = ρ ≤ c. [sent-872, score-0.191]
88 Then the adaptive sampling algorithm costs O nk2 ε−2 + TMultiply mnkε−1 time to construct R2 . [sent-894, score-0.41]
89 The near-optimal column selection algorithm costs O mk2 ε−4/3 + mk3 ε−2/3 + TMultiply m2 kε−2/3 time to select c1 = O (kε−1 ) columns of A construct C1 . [sent-900, score-0.211]
90 Then the adaptive sampling algorithm costs O mk2 ε−2 + TMultiply m2 kε−1 time to select c2 = O (kε−2 ) columns construct C2 . [sent-901, score-0.491]
91 ˆ Proof Let C consist of c column sampled from A and Ci consist of ci columns sampled from the i-th ˆ block diagonal matrix in A. [sent-983, score-0.219]
92 Without loss of generality, we assume Ci consists of the first ci columns ˆ ˆ ˆ of B, and accordingly Wi consists of the top left ci × ci block of B. [sent-984, score-0.183]
93 T † ˜ ens ˜ ens where Bt,c = 1 ∑ti=1 C(i) W(i) C(i) . [sent-1027, score-0.202]
94 2763 WANG AND Z HANG ens o Figure 8: An illustration of the matrix B − Bt,c for the ensemble Nystr¨ m method where B is defined in (9). [sent-1032, score-0.209]
95 For the ensemble Nystr¨ m o ens method without overlapping, the matrix B − Bt,c can always be partitioned into four regions as annotated. [sent-1035, score-0.209]
96 We can express it as ˜ ens B − Bt,c = B − T 1 t (i) (i) † (i) T 1 t ∑ C W C = t ∑ B − C(i) W† C(i) , t i=1 i=1 (18) ˜ ens and then a discreet examination reveals that B − Bt,c can be partitioned into four kinds of regions as illustrated in Figure 8. [sent-1039, score-0.202]
97 Therefore, the nuclear norm ens ) equals to the matrix trace, that is, ˜ of (B − Bt,c ˜ ens B − Bt,c ∗ ˜ ens = tr B − Bt,c t −1 (1 − η) + (m − tc) · (1 − η) = tc · t 1 c+ α = (1 − α)(m − c) , c + 1−α α which proves the nuclear norm bound in the lemma. [sent-1045, score-0.539]
98 ˆ (i) ˆ ˆ (i) Furthermore, since the matrix B − 1 ∑ti=1 C j W† C j j t ˜ ens diagonal matrix (A − At,c ) is also SPSD. [sent-1058, score-0.199]
99 Thus we have ˜ ens A − At,c ∗ = (1 − α) ∑ (p − ci ) i=1 1 ci + α ci + 1−α α T 2 , is SPSD by Lemma 23, so the block ≥ (1 − α)(m − c) 1 + k c + 1−α k α , which proves the nuclear norm bound in the theorem. [sent-1059, score-0.281]
100 Then there exists an m×m SPSD matrix A such that the relative-error ratio of the ensemble Nystr¨ m method is lower bounded o by ˜ ens A − At,c F A − Ak F ˜ ens A − At,c ∗ A − Ak ∗ † ≥ ≥ T m − 2c + c/t − k k(m − 2c + c/t) 1+ , m−k c2 k m−c 1+ , m−k c ˜ ens where At,c = 1 ∑ti=1 C(i) W(i) C(i) . [sent-1061, score-0.441]
wordName wordTfidf (topN-words)
[('cur', 0.501), ('nystr', 0.487), ('modified', 0.417), ('sampling', 0.175), ('ar', 0.164), ('adaptive', 0.16), ('subspace', 0.154), ('cc', 0.121), ('tmultiply', 0.113), ('elapsed', 0.112), ('ens', 0.101), ('drineas', 0.093), ('spsd', 0.093), ('ak', 0.091), ('ecomposition', 0.09), ('om', 0.09), ('ystr', 0.09), ('near', 0.088), ('columns', 0.081), ('mproving', 0.077), ('ct', 0.068), ('bnys', 0.068), ('mnk', 0.068), ('pproximation', 0.067), ('atrix', 0.06), ('ensemble', 0.059), ('deshpande', 0.058), ('mahoney', 0.058), ('column', 0.055), ('svd', 0.054), ('cw', 0.052), ('boutsidis', 0.05), ('matrix', 0.049), ('tc', 0.048), ('scra', 0.045), ('pc', 0.044), ('hang', 0.044), ('uniform', 0.041), ('rows', 0.037), ('leverage', 0.037), ('nuclear', 0.037), ('rm', 0.035), ('qr', 0.035), ('nys', 0.035), ('costs', 0.034), ('ci', 0.034), ('incurred', 0.033), ('vt', 0.031), ('bv', 0.031), ('approximation', 0.03), ('ratio', 0.03), ('bounds', 0.029), ('rank', 0.029), ('sparsi', 0.028), ('singular', 0.028), ('frobenius', 0.028), ('cj', 0.027), ('guruswami', 0.027), ('error', 0.027), ('truncated', 0.026), ('decomposition', 0.026), ('conventional', 0.025), ('rbf', 0.024), ('intersection', 0.024), ('span', 0.024), ('scores', 0.024), ('norm', 0.024), ('time', 0.023), ('deterministic', 0.023), ('anys', 0.023), ('sinop', 0.023), ('wang', 0.022), ('av', 0.022), ('ck', 0.021), ('im', 0.02), ('talwalkar', 0.019), ('theorem', 0.019), ('standard', 0.019), ('kumar', 0.019), ('optimal', 0.018), ('construct', 0.018), ('cuct', 0.018), ('pivoted', 0.018), ('target', 0.018), ('abalone', 0.017), ('rv', 0.017), ('apc', 0.017), ('pi', 0.017), ('bc', 0.017), ('bound', 0.017), ('wk', 0.016), ('rr', 0.016), ('stewart', 0.016), ('residual', 0.016), ('ops', 0.015), ('gittens', 0.015), ('blkdiag', 0.015), ('avr', 0.015), ('lemma', 0.015), ('rt', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling
Author: Shusen Wang, Zhihua Zhang
Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semidefinite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling
2 0.40476725 59 jmlr-2013-Large-scale SVD and Manifold Learning
Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley
Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization
3 0.061525665 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
4 0.06004997 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk
Abstract: Unions of subspaces provide a powerful generalization of single subspace models for collections of high-dimensional data; however, learning multiple subspaces from data is challenging due to the fact that segmentation—the identification of points that live in the same subspace—and subspace estimation must be performed simultaneously. Recently, sparse recovery methods were shown to provide a provable and robust strategy for exact feature selection (EFS)—recovering subsets of points from the ensemble that live in the same subspace. In parallel with recent studies of EFS with ℓ1 -minimization, in this paper, we develop sufficient conditions for EFS with a greedy method for sparse signal recovery known as orthogonal matching pursuit (OMP). Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. In particular, we demonstrate that sparse recovery methods provide significant advantages over NN methods and that the gap between the two approaches is particularly pronounced when the sampling of subspaces in the data set is sparse. Our results suggest that OMP may be employed to reliably recover exact feature sets in a number of regimes where NN approaches fail to reveal the subspace membership of points in the ensemble. Keywords: subspace clustering, unions of subspaces, hybrid linear models, sparse approximation, structured sparsity, nearest neighbors, low-rank approximation
5 0.03800888 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
Author: Nima Noorshams, Martin J. Wainwright
Abstract: The sum-product or belief propagation (BP) algorithm is a widely used message-passing technique for computing approximate marginals in graphical models. We introduce a new technique, called stochastic orthogonal series message-passing (SOSMP), for computing the BP fixed point in models with continuous random variables. It is based on a deterministic approximation of the messages via orthogonal series basis expansion, and a stochastic estimation of the basis coefficients via Monte Carlo techniques and damped updates. We prove that the SOSMP iterates converge to a δ-neighborhood of the unique BP fixed point for any tree-structured graph, and for any graphs with cycles in which the BP updates satisfy a contractivity condition. In addition, we demonstrate how to choose the number of basis coefficients as a function of the desired approximation accuracy δ and smoothness of the compatibility functions. We illustrate our theory with both simulated examples and in application to optical flow estimation. Keywords: graphical models, sum-product for continuous state spaces, low-complexity belief propagation, stochastic approximation, Monte Carlo methods, orthogonal basis expansion
6 0.037900161 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
7 0.031978197 106 jmlr-2013-Stationary-Sparse Causality Network Learning
8 0.031297613 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
9 0.030863581 114 jmlr-2013-The Rate of Convergence of AdaBoost
10 0.030712619 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
11 0.030659508 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
12 0.03020256 8 jmlr-2013-A Theory of Multiclass Boosting
13 0.02800444 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds
14 0.026601091 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
15 0.025999237 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning
16 0.025777778 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models
17 0.02254216 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
18 0.021059969 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
19 0.020991957 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
20 0.020773007 90 jmlr-2013-Quasi-Newton Method: A New Direction
topicId topicWeight
[(0, -0.147), (1, 0.106), (2, 0.033), (3, -0.229), (4, -0.033), (5, 0.063), (6, -0.072), (7, -0.173), (8, 0.233), (9, 0.413), (10, 0.511), (11, 0.042), (12, 0.223), (13, -0.042), (14, -0.099), (15, 0.083), (16, -0.009), (17, 0.075), (18, 0.109), (19, -0.025), (20, -0.071), (21, -0.058), (22, -0.003), (23, 0.014), (24, 0.031), (25, 0.018), (26, -0.001), (27, -0.005), (28, -0.042), (29, -0.006), (30, -0.063), (31, -0.025), (32, 0.001), (33, 0.025), (34, -0.024), (35, 0.018), (36, -0.022), (37, -0.017), (38, 0.017), (39, 0.03), (40, -0.007), (41, -0.025), (42, 0.007), (43, -0.011), (44, -0.008), (45, -0.013), (46, 0.023), (47, -0.016), (48, 0.022), (49, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.95983011 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling
Author: Shusen Wang, Zhihua Zhang
Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semidefinite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling
2 0.86996657 59 jmlr-2013-Large-scale SVD and Manifold Learning
Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley
Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization
3 0.27908397 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
4 0.20827514 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk
Abstract: Unions of subspaces provide a powerful generalization of single subspace models for collections of high-dimensional data; however, learning multiple subspaces from data is challenging due to the fact that segmentation—the identification of points that live in the same subspace—and subspace estimation must be performed simultaneously. Recently, sparse recovery methods were shown to provide a provable and robust strategy for exact feature selection (EFS)—recovering subsets of points from the ensemble that live in the same subspace. In parallel with recent studies of EFS with ℓ1 -minimization, in this paper, we develop sufficient conditions for EFS with a greedy method for sparse signal recovery known as orthogonal matching pursuit (OMP). Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. In particular, we demonstrate that sparse recovery methods provide significant advantages over NN methods and that the gap between the two approaches is particularly pronounced when the sampling of subspaces in the data set is sparse. Our results suggest that OMP may be employed to reliably recover exact feature sets in a number of regimes where NN approaches fail to reveal the subspace membership of points in the ensemble. Keywords: subspace clustering, unions of subspaces, hybrid linear models, sparse approximation, structured sparsity, nearest neighbors, low-rank approximation
5 0.1529886 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel
Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive
6 0.15140536 106 jmlr-2013-Stationary-Sparse Causality Network Learning
7 0.14633353 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
8 0.14565216 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
9 0.12832524 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
10 0.12516725 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
11 0.1235923 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
12 0.11873133 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
13 0.11482601 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning
14 0.11299884 114 jmlr-2013-The Rate of Convergence of AdaBoost
15 0.11056688 43 jmlr-2013-Fast MCMC Sampling for Markov Jump Processes and Extensions
16 0.10891166 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation
17 0.10580428 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems
18 0.098798677 8 jmlr-2013-A Theory of Multiclass Boosting
19 0.094300658 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
20 0.094219156 76 jmlr-2013-Nonparametric Sparsity and Regularization
topicId topicWeight
[(0, 0.027), (5, 0.131), (6, 0.038), (10, 0.071), (23, 0.021), (34, 0.415), (67, 0.025), (68, 0.016), (70, 0.024), (75, 0.024), (85, 0.02), (87, 0.045), (89, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.65110123 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling
Author: Shusen Wang, Zhihua Zhang
Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semidefinite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling
2 0.5736903 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
3 0.36769724 59 jmlr-2013-Large-scale SVD and Manifold Learning
Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley
Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization
4 0.36059776 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
5 0.3553623 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
Author: Tony Cai, Wen-Xin Zhou
Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence
7 0.35373864 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression
8 0.35182342 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
9 0.35134998 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
11 0.35010305 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
12 0.34872395 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering
13 0.34861237 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
14 0.34758419 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning
15 0.34695584 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso
16 0.34649178 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
17 0.34575644 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
18 0.3457011 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning
19 0.34458673 15 jmlr-2013-Bayesian Canonical Correlation Analysis
20 0.3445583 114 jmlr-2013-The Rate of Convergence of AdaBoost