jmlr jmlr2012 jmlr2012-43 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
Reference: text
sentIndex sentText sentNum sentScore
1 COM IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 Editor: Mehryar Mohri Abstract The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. [sent-11, score-1.663]
2 Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. [sent-13, score-0.732]
3 A related notion is that of matrix coherence, which has been of interest in recently popular problems such as matrix completion and Nystr¨ m-based lowo rank matrix approximation (Candes and Recht, 2008; Talwalkar and Rostamizadeh, 2010). [sent-20, score-0.445]
4 D RINEAS , M AGDON -I SMAIL , M AHONEY AND W OODRUFF more precisely below, the statistical leverage scores may be computed as the squared Euclidean norms of the rows of the matrix containing the top left singular vectors and the coherence of the matrix is the largest statistical leverage score. [sent-24, score-1.835]
5 Statistical leverage scores have a long history in statistical data analysis, where they have been used for outlier detection in regression diagnostics (Hoaglin and Welsch, 1978; Chatterjee and Hadi, 1986). [sent-25, score-0.824]
6 Statistical leverage scores have also proved crucial recently in the development of improved worst-case randomized matrix algorithms that are also amenable to high-quality numerical implementation and that are useful to domain scientists (Drineas et al. [sent-26, score-0.98]
7 We present a randomized algorithm to compute relative-error approximations to every statistical leverage score in time qualitatively faster than the time required to compute an orthogonal basis. [sent-32, score-0.786]
8 1 Overview and Definitions We start with the following definition of the statistical leverage scores of a matrix. [sent-37, score-0.824]
9 Definition 1 Given an arbitrary n × d matrix A, with n > d, let U denote the n × d matrix consisting of the d left singular vectors of A, and let U(i) denote the i-th row of the matrix U as a row vector. [sent-38, score-0.443]
10 Then, the statistical leverage scores of the rows of A are given by ℓi = U(i) 2 , 2 for i ∈ {1, . [sent-39, score-0.927]
11 ,n} that is, it is the largest statistical leverage score of A; and the (i, j)-cross-leverage scores ci j are ci j = U(i) ,U( j) , that is, they are the dot products between the ith row and the jth row of U. [sent-45, score-0.997]
12 (1) FAST A PPROXIMATION OF M ATRIX C OHERENCE AND S TATISTICAL L EVERAGE That is, the statistical leverage scores of a matrix A are equal to the diagonal elements of the projection matrix onto the span of its columns. [sent-49, score-1.067]
13 (2) Clearly, O(nd 2 ) time suffices to compute all the statistical leverage scores exactly: simply perform the SVD or compute a QR decomposition of A in order to obtain any orthogonal basis for the range of A and then compute the Euclidean norm of the rows of the resulting matrix. [sent-51, score-1.16]
14 2 Second, one could also define leverage scores for the columns of a “tall” matrix A, but clearly those are all equal to one unless n < d or A is rank-deficient. [sent-56, score-0.93]
15 Third, and more generally, given a rank parameter k, one can define the statistical leverage scores relative to the best rank-k approximation to A to be the n diagonal elements of the projection matrix onto the span of Ak , the best rank-k approximation to A. [sent-57, score-1.167]
16 Our main algorithm for computing approximations to the statistical leverage scores (see Algorithm 1 in Section 3) will amount to constructing a “randomized sketch” of the input matrix and then computing the Euclidean norms of the rows of that sketch. [sent-60, score-1.099]
17 This sketch can also be used to compute approximations to the large cross-leverage scores (see Algorithm 2 of Section 3). [sent-61, score-0.457]
18 Theorem 2 Let A be a full-rank n × d matrix, with n ≫ d; let ε ∈ (0, 1/2] be an error parameter; and recall the definition of the statistical leverage scores ℓi from Definition 1. [sent-63, score-0.824]
19 Observe that if U consists of d columns from the identity, then the leverage scores are extremely nonuniform: d of them are equal to one and the remainder are equal to zero. [sent-73, score-0.834]
20 3 for a definition), then the leverage scores are very uniform: all n of them are equal to d/n. [sent-75, score-0.794]
21 Assuming d ≤ n ≤ ed , the running time of the algorithm is O nd ln dε−1 + ndε−2 ln n + d 3 ε−2 (ln n) ln dε−1 . [sent-80, score-0.429]
22 Algorithm 1 provides a relative-error approximation to all of the statistical leverage scores ℓi of A n and, assuming d ln d = o ln n , ln n = o (d), and treating ε as a constant, its running time is o(nd 2 ), as desired. [sent-81, score-1.292]
23 As a corollary, the largest leverage score (and thus the coherence) is approximated to relative-error in o(nd 2 ) time. [sent-82, score-0.479]
24 The statistical leverage scores define the key structural nonuniformity that must be dealt with (i. [sent-96, score-0.824]
25 Roughly, the best random sampling algorithms use these scores (or the generalized leverage scores relative to the best rank-k approximation to A) as an importance sampling distribution to sample with respect to. [sent-103, score-1.216]
26 On the other hand, the best random projection algorithms rotate to a basis where these scores are approximately uniform and thus in which uniform sampling is appropriate. [sent-104, score-0.433]
27 By our main result, the leverage scores (and thus 3478 FAST A PPROXIMATION OF M ATRIX C OHERENCE AND S TATISTICAL L EVERAGE these probabilities) can be approximated in time that depends on an application of a Fast JohnsonLindenstrauss Transform. [sent-113, score-0.794]
28 The statistical leverage scores and the scores relative to the best rank-k approximation to A are equal to the diagonal elements of the so-called “hat matrix” (Hoaglin and Welsch, 1978; Chatterjee and Hadi, 1988). [sent-129, score-1.178]
29 When applied to low-rank matrix approximation problems, the leverage score ℓ j quantifies the amount of leverage or influence exerted by the jth column of A on its optimal low-rank approximation. [sent-132, score-1.093]
30 First, note that statistical leverage and matrix coherence are important concepts in statistics and machine learning. [sent-149, score-0.68]
31 But recall that the existence of those projection algorithms in no way implies that it is easy or obvious how to compute the statistical leverage scores efficiently. [sent-152, score-0.925]
32 Sixth, we develop algorithms that can compute leverage scores and related statistics even in streaming environments. [sent-156, score-0.907]
33 4 Empirical Discussion of Our Algorithms Although the main contribution of our paper is to provide a rigorous theoretical understanding of fast leverage score approximation, our paper does analyze the theoretical performance of what is meant to be a practical algorithm. [sent-158, score-0.479]
34 Moreover, for both Blendenpik and LSRN (when implemented with a Hadamard-based random projection), the hidden constants in the Hadamard-based random projection are so small that the random projection algorithm (and thus the empirical running time of our main algorithm for approximating leverage 3. [sent-172, score-0.617]
35 Finally, for any orthogonal matrix U ∈ Rn×ℓ , let U ⊥ ∈ Rn×(n−ℓ) denote an orthogonal matrix whose columns are an orthonormal basis spanning the subspace of Rn that is orthogonal to the subspace spanned by the columns of U (i. [sent-196, score-0.455]
36 The hard part of computing the scores ℓi according to Equation (5) is computing an orthogonal matrix U spanning the range of A, which takes O(nd 2 ) time. [sent-266, score-0.461]
37 Not surprisingly, the sketch A (Π1 A)† Π2 can be used in other ways: for example, by considering the dot product between two different rows of this randomized sketching matrix (and some additional manipulations) Algorithm 2 approximates the large cross-leverage scores of A. [sent-285, score-0.735]
38 2 Algorithm 1: Approximating the (diagonal) statistical leverage scores ℓi . [sent-300, score-0.824]
39 ˆ This lemma says that the ℓi of Equation (7) can be computed with any QR decomposition, rather than with the SVD; but note that one would still have to post-multiply by Π2 , as in Algorithm 1, in order to compute “quickly” the approximations of the leverage scores. [sent-319, score-0.608]
40 Then, the pairwise dot-products of the rows of Ω are additive-error approximations to the leverage scores and cross-leverage scores: U(i) ,U( j) − Ω(i) , Ω( j) ≤ 3ε U 1 − ε (i) 2 U( j) 2 . [sent-325, score-0.934]
41 ˜ 2 Lemma 9 For i, j ∈ [n], U(i) ,U( j) − ui , u j ˆ ˆ Lemma 10 For i, j ∈ [n], ≤ ui , u j − ui , u j ˆ ˆ ˜ ˜ ε U 1 − ε (i) ≤ 2ε ui ˆ 2 2 uj ˆ U( j) 2 2 . [sent-388, score-0.618]
42 Lemma 9 states that ui , u j is an additive error approximation to all the cross-leverage scores ˆ ˆ (i = j) and a relative error approximation for the diagonals (i = j). [sent-390, score-0.537]
43 To conclude the proof, multiply throughout by ui ˆ product, together with the linearity of Π2 , to obtain: ui , u j − 2ε ui ˆ ˆ ˆ uj ˆ and use the homogeneity of the inner u j ≤ ui Π2 , u j Π2 ≤ ui , u j + 2ε ui ˆ ˆ ˆ ˆ ˆ ˆ uj . [sent-408, score-0.948]
44 It follows that the asymptotic running time is O nd ln dε−1 + ndε−2 ln n + d 3 ε−2 (ln n) ln dε−1 . [sent-417, score-0.429]
45 , un denote the rows of U; ˜ ˜ then, from Lemma 8, 3ε 3ε ui , u j − ui u j ≤ ui , u j ≤ ui , u j + ui u j . [sent-465, score-0.823]
46 ˜ ˜ (11) 1−ε 1−ε Given ε, κ, assume that for the pair of vectors ui and u j 2 ui , u j ≥ 1 T U U κ 2 F + 12ε ui where the last equality follows from U T U using ε < 0. [sent-466, score-0.432]
47 5, ui , u j Thus, ui , u j ˜ ˜ equivalently, 2 2 − 12ε ui 2 ε u j 2 2 F 2 uj = Id ≤ ui , u j ˜ ˜ 2 2 F 2 = d + 12ε ui κ 2 uj 2 , = d. [sent-467, score-0.804]
48 d≥ 1 + 30dε We conclude that ui , u j 2 d ≥ + 12ε ui κ 2 2 uj 2 =⇒ ui , u j ˜ ˜ 2 By construction, Algorithm 3 is invoked with κ′ = κ ΩT Ω ui , u j ˜ ˜ 2 ≥ 2 ΩT Ω F /κ′ ΩT Ω F d ≥ ≥ . [sent-470, score-0.618]
49 Extending Our Algorithm to General Matrices In this section, we will describe an important extension of our main result, namely the computation of the statistical leverage scores relative to the best rank-k approximation to a general matrix A. [sent-478, score-0.959]
50 More specifically, we consider the estimation of leverage scores for the case of general “fat” matrices, namely input matrices A ∈ Rn×d , where both n and d are large, for example, when d = n or d = Θ(n). [sent-479, score-0.835]
51 Clearly, the leverage scores of any full rank n × n matrix are exactly uniform. [sent-480, score-1.008]
52 In this case, 2 we wish to obtain the statistical leverage scores ℓi = (Uk )(i) 2 for Ak = Uk ΣkVkT , the best rank-k approximation to A. [sent-488, score-0.863]
53 Equivalently, we seek the normalized leverage scores pi = ℓi . [sent-489, score-0.925]
54 In this case, Uk is not unique and the leverage scores are not well-defined. [sent-495, score-0.794]
55 Moreover, for the obvious n equivalent choices for Uk , the leverage scores defined k according to any one of these choices do not provide a relative error approximation to the leverage scores defined according to any other choices. [sent-496, score-1.627]
56 In this example, the leverage scores for Ak are well defined. [sent-499, score-0.794]
57 Any algorithm which cannot distinguish the singular values with an error less than γ will confuse the k-th and (k + 1)-th singular vectors and consequently will fail to get an accurate approximation to the leverage scores for Ak . [sent-503, score-1.019]
58 To do so, recall that the leverage scores and the related normalized leverage scores of Equation (14) are used to approximate the matrix in some way, for example, we might be seeking a low-rank approximation to the matrix with respect to the spectral (Drineas et al. [sent-505, score-1.901]
59 In all these cases, we only care that the estimated leverage scores are a good approximation to the leverage scores of some “good” low-rank approximation to A. [sent-510, score-1.666]
60 We are now ready to define our approximations to the normalized leverage scores of any matrix A ∈ Rn×d given a rank parameter k ≪ min {n, d}. [sent-514, score-1.089]
61 Instead of seeking to approximate the pi of Equation (14) (a problem that is ill-posed as discussed above), we will be satisfied if we can approximate the normalized leverage scores of some matrix X ∈ Sε . [sent-515, score-1.021]
62 We call the numbers pi (for all i ∈ [n]) β-approximations to the normalized leverage ˆ scores of Ak (the best rank-k approximation to A) if, for some matrix X ∈ Sε , pi ≥ ˆ β (UX )(i) k 2 2 n and ˆ ∑ pi = 1. [sent-518, score-1.234]
63 Thus, we will seek algorithms whose output is a set of numbers, with the requirement that those numbers are good approximations to the normalized leverage scores of some matrix X ∈ Sε (instead of Ak ). [sent-520, score-0.971]
64 Next, we will give two examples of algorithms that compute such β-approximations to the normalized leverage scores of a general matrix A with a rank parameter k for two popular norms, the spectral norm and the Frobenius norm. [sent-522, score-1.14]
65 1 Leverage Scores for Spectral Norm Approximators Algorithm 4 approximates the statistical leverage scores of a general matrix A with rank parameter k in the spectral norm case. [sent-524, score-1.076]
66 It takes as inputs a matrix A ∈Rn×d with rank(A) = ρ and a rank parameter k ≪ ρ, and outputs a set of numbers pi for all i ∈ [n], namely our approximations to the ˆ normalized leverage scores of A with rank parameter k. [sent-525, score-1.294]
67 The next lemma argues that there exists a matrix X ∈ Rn×d of rank k that is sufficiently close to A (in particular, it is a member of Sε with constant probability) and, additionally, can be written as X = BY, where Y ∈ R2k×d is a matrix of rank k. [sent-526, score-0.47]
68 Approximately compute the statistical leverage scores of the “tall” matrix B by ˆ calling Algorithm 1 with inputs B and ε; let ℓi (for all i ∈ [n]) be the outputs of Algorithm 1. [sent-550, score-0.97]
69 Algorithm 4: Approximating the statistical leverage scores of a general matrix A (spectral norm case). [sent-553, score-0.92]
70 If B = AAT AΠ, where k 2 ln 1 + k−1 + e k min {n, d} − k q≥ 2 ln (1 + ε/10) − 1/2 , (16) then there exists a matrix X ∈ Rn×d of rank k satisfying X = BY (with Y ∈ R2k×d ) such that E[ A−X 2] ≤ 1+ ε 10 A − Ak 2 . [sent-558, score-0.476]
71 3494 FAST A PPROXIMATION OF M ATRIX C OHERENCE AND S TATISTICAL L EVERAGE The next step of the proposed algorithm is to approximately compute the leverage scores of B ∈ Rn×2k via Algorithm 1. [sent-569, score-0.844]
72 Now consider the approximate leverage ˆ scores ℓi computed by Algorithm 1 and note that (by Theorem 2), 2 ˆ ℓi − (UB )(i) 2 ≤ ε (UB )(i) 2 2 holds with probability at least 0. [sent-574, score-0.794]
73 2 Clearly, (UX )(i) /k are the normalized leverage scores of the matrix X. [sent-578, score-0.934]
74 9 and use Definition 14 to conclude that the scores pi of Equation (15) are ˆ 1−ε 2(1+ε) -approximations to the normalized leverage scores of A with rank parameter k. [sent-580, score-1.358]
75 The proposed algorithm runs in ln (min{n, d}) + nkε−2 ln n O ndk ln (1 + ε) time. [sent-583, score-0.422]
76 ˆ Algorithm 5: Approximating the statistical leverage scores of a general matrix A (Frobenius norm case). [sent-599, score-0.92]
77 2 Leverage Scores for Frobenius Norm Approximators Algorithm 5 approximates the statistical leverage scores of a general matrix A with rank parameter k in the Frobenius norm case. [sent-601, score-1.038]
78 It takes as inputs a matrix A ∈Rn×d with rank(A) = ρ and a rank parameter k ≪ ρ, and outputs a set of numbers pi for all i ∈ [n], namely our approxiˆ mations to the normalized leverage scores of A with rank parameter k. [sent-602, score-1.257]
79 Unlike the previous section (the spectral norm case), we will now be able to provide a closed-form formula for this matrix X and, more importantly, the normalized leverage scores of X will be exactly equal to the pi returned by our algorithm. [sent-605, score-1.059]
80 Thus, in the parlance ˆ of Definition 14, we will get a 1-approximation to the normalized leverage scores of A with rank parameter k. [sent-606, score-0.956]
81 In the above, ΣQT A,k ∈ Rk×k is the diagonal matrix containing the top k singular values of QT A and T VQT A,k ∈ Rk×d is the matrix whose rows are the top k right singular vectors of QT A. [sent-624, score-0.481]
82 Theorem 18 Given A ∈ Rn×d , a rank parameter k, and an accuracy parameter ε, Algorithm 5 computes a set of normalized leverage scores pi that are 1-approximations to the normalized leverage ˆ scores of A with rank parameter k with probability at least 0. [sent-632, score-1.999]
83 1 A Related Estimator for the Leverage Scores Magdon-Ismail (2010) presented the following algorithm to estimate the statistical leverage scores: given as input an n × d matrix A, with n ≫ d, the algorithm proceeds as follows. [sent-641, score-0.605]
84 ˜ i Magdon-Ismail (2010) argued that the output pi achieves an O(ln2 n) approximation to all of the ˜ (normalized) statistical leverage scores of A in roughly O(nd 2 / ln n) time. [sent-649, score-1.081]
85 (To our knowledge, prior to our work here, this is the only known estimator that obtains any nontrivial provable approximation to the leverage scores of a matrix in o(nd 2 ) time. [sent-650, score-0.929]
86 This truncation-renormalization step has the effect of inflating the estimates of the small leverage scores by an O(ln2 n) factor. [sent-654, score-0.794]
87 A direction of considerable practical interest is to evaluate empirically the performance of these two estimators, either for estimating all the leverage scores or (more interestingly) for estimating the largest leverage scores for data matrices for which the leverage scores are quite nonuniform. [sent-660, score-2.423]
88 ) If xopt ˜ is computed via Algorithm 6 then, with probability at least 1 − δ, xopt − xopt ˜ 2 ≤ 2ε xopt 2. [sent-675, score-0.724]
89 i 2 i∈[n] Thus, we conclude our proof by observing that xopt − xopt ˜ 2 = (In + E) Σ−1U T b − Σ−1U T b −1 = EΣ U b ≤ E 2 2 T 2 Σ−1U T b ≤ 2ε xopt 2 2. [sent-693, score-0.543]
90 It should be clear that we can use Theorem 2 and the related Algorithm 1 to approximate the statistical leverage scores, thus bypassing the need to exactly compute them. [sent-698, score-0.559]
91 Second, instead of approximating the statistical leverage scores needed in Algorithm 6, we could use the randomized Hadamard transform (essentially post-multiply A by a randomized Hadamard transform to make all statistical leverage scores uniform). [sent-699, score-1.904]
92 3 Extension to Streaming Environments In this section, we consider the estimation of the leverage scores and of related statistics when the input data set is so large that an appropriate way to view the data is as a data stream (Muthukrishnan, 2005). [sent-704, score-0.794]
93 • As the data streams by, compute TA, for an appropriate problem-dependent linear sketching matrix T , and also compute ΠA, for a random projection matrix Π. [sent-725, score-0.419]
94 With the procedure outlined above, the matrix T is effectively applied to the rows of AR−1 Π2 , that is, to the sketch of A that has rows with Euclidean norms approximately equal to the row norms of U, and pairwise inner products approximately equal to those in U. [sent-728, score-0.495]
95 When applied to our setting, we can apply a random projection matrix Π and a linear sketching matrix T which has O(dτ−1 ε−2 log3 (n) log τ−1 ) rows in the following manner. [sent-747, score-0.422]
96 (2010), we can find all the leverage scores U(i) 2 that are of magnitude at least τ U 2 in small space and a F 2 single pass over the data. [sent-752, score-0.829]
97 Thus, to compute the entropy of the leverage score distribution, we can do the following. [sent-766, score-0.529]
98 Another natural problem is that of obtaining samples of rows of A proportional to their leverage score importance sampling probabilities. [sent-775, score-0.616]
99 , 2010) as used above for finding the large leverage scores. [sent-777, score-0.479]
100 Influential observations, high leverage points, and outliers in linear regression. [sent-875, score-0.479]
wordName wordTfidf (topN-words)
[('leverage', 0.479), ('scores', 0.315), ('drineas', 0.257), ('xopt', 0.181), ('agdon', 0.153), ('oodruff', 0.153), ('rineas', 0.153), ('smail', 0.153), ('ui', 0.144), ('everage', 0.143), ('mahoney', 0.142), ('ln', 0.131), ('ahoney', 0.131), ('oherence', 0.122), ('tatistical', 0.122), ('rank', 0.118), ('fjlt', 0.105), ('rows', 0.103), ('qt', 0.096), ('matrix', 0.096), ('pproximation', 0.095), ('ak', 0.094), ('singular', 0.093), ('rn', 0.091), ('randomized', 0.09), ('pi', 0.087), ('srht', 0.086), ('atrix', 0.085), ('svd', 0.079), ('avron', 0.076), ('chatterjee', 0.076), ('sketching', 0.076), ('coherence', 0.075), ('boutsidis', 0.075), ('ux', 0.073), ('ar', 0.07), ('paschou', 0.067), ('sarl', 0.067), ('streaming', 0.063), ('andoni', 0.057), ('uqt', 0.057), ('welsch', 0.057), ('rokhlin', 0.057), ('rr', 0.057), ('sketch', 0.055), ('qr', 0.054), ('projection', 0.051), ('compute', 0.05), ('orthogonal', 0.05), ('hadamard', 0.049), ('bekas', 0.048), ('blendenpik', 0.048), ('hadi', 0.048), ('hoaglin', 0.048), ('lsrn', 0.048), ('monemizadeh', 0.048), ('quqt', 0.048), ('woodruff', 0.048), ('ailon', 0.047), ('heavy', 0.045), ('ub', 0.044), ('normalized', 0.044), ('uj', 0.042), ('lemma', 0.042), ('matrices', 0.041), ('tar', 0.041), ('ci', 0.041), ('columns', 0.04), ('pseudoinverse', 0.04), ('norms', 0.039), ('approximation', 0.039), ('frobenius', 0.038), ('spectral', 0.038), ('harvey', 0.038), ('uu', 0.038), ('transform', 0.038), ('ta', 0.037), ('tall', 0.037), ('approximations', 0.037), ('aa', 0.037), ('rd', 0.036), ('running', 0.036), ('pass', 0.035), ('sampling', 0.034), ('basis', 0.033), ('row', 0.031), ('hn', 0.031), ('bottleneck', 0.03), ('ls', 0.03), ('pairs', 0.03), ('statistical', 0.03), ('halko', 0.029), ('products', 0.029), ('apack', 0.029), ('ganguly', 0.029), ('jlt', 0.029), ('meng', 0.029), ('ndk', 0.029), ('ndr', 0.029), ('overconstrained', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
2 0.13146228 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
3 0.088831365 82 jmlr-2012-On the Necessity of Irrelevant Variables
Author: David P. Helmbold, Philip M. Long
Abstract: This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant. Keywords: feature selection, generalization, learning theory
4 0.085422039 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
5 0.082669064 97 jmlr-2012-Regularization Techniques for Learning with Matrices
Author: Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari
Abstract: There is growing body of learning problems for which it is natural to organize the parameters into a matrix. As a result, it becomes easy to impose sophisticated prior knowledge by appropriately regularizing the parameters under some matrix norm. This work describes and analyzes a systematic method for constructing such matrix-based regularization techniques. In particular, we focus on how the underlying statistical properties of a given problem can help us decide which regularization function is appropriate. Our methodology is based on a known duality phenomenon: a function is strongly convex with respect to some norm if and only if its conjugate function is strongly smooth with respect to the dual norm. This result has already been found to be a key component in deriving and analyzing several learning algorithms. We demonstrate the potential of this framework by deriving novel generalization and regret bounds for multi-task learning, multi-class learning, and multiple kernel learning. Keywords: regularization, strong convexity, regret bounds, generalization bounds, multi-task learning, multi-class learning, multiple kernel learning
6 0.071495362 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
7 0.06496878 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
8 0.062364984 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
9 0.062005918 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
10 0.058500148 80 jmlr-2012-On Ranking and Generalization Bounds
11 0.055882592 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
12 0.053531196 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
13 0.049993534 73 jmlr-2012-Multi-task Regression using Minimal Penalties
14 0.048477974 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
15 0.046122793 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
16 0.044648129 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
17 0.043820262 5 jmlr-2012-A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally
18 0.042495921 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
19 0.042145126 91 jmlr-2012-Plug-in Approach to Active Learning
20 0.041228879 111 jmlr-2012-Structured Sparsity and Generalization
topicId topicWeight
[(0, -0.209), (1, 0.057), (2, -0.021), (3, -0.003), (4, -0.074), (5, -0.092), (6, 0.024), (7, -0.001), (8, 0.05), (9, -0.218), (10, -0.145), (11, -0.008), (12, -0.037), (13, -0.005), (14, 0.121), (15, -0.049), (16, 0.221), (17, 0.094), (18, 0.15), (19, -0.157), (20, -0.093), (21, -0.203), (22, -0.237), (23, 0.037), (24, 0.145), (25, 0.07), (26, 0.166), (27, 0.075), (28, 0.005), (29, 0.017), (30, -0.003), (31, -0.088), (32, -0.042), (33, -0.082), (34, 0.043), (35, -0.062), (36, 0.005), (37, 0.078), (38, 0.035), (39, 0.067), (40, 0.042), (41, 0.082), (42, 0.058), (43, 0.004), (44, -0.02), (45, 0.028), (46, -0.096), (47, -0.003), (48, 0.093), (49, -0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.94889116 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
2 0.78557187 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
3 0.46000397 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
Author: Karthik Mohan, Maryam Fazel
Abstract: The problem of minimizing the rank of a matrix subject to affine constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to find the minimum rank matrix under suitable assumptions. In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 ≤ p ≤ 1), as a computationally efficient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p = 1, we give theoretical guarantees similar to those for nuclear norm minimization, that is, recovery of low-rank matrices under certain assumptions on the operator defining the constraints. For p < 1, IRLSp shows better empirical performance in terms of recovering low-rank matrices than nuclear norm minimization. We provide an efficient implementation for IRLS-p, and also present a related family of algorithms, sIRLS-p. These algorithms exhibit competitive run times and improved recovery when compared to existing algorithms for random instances of the matrix completion problem, as well as on the MovieLens movie recommendation data set. Keywords: matrix rank minimization, matrix completion, iterative algorithms, null-space property
4 0.42194605 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
Author: Uri Shalit, Daphna Weinshall, Gal Chechik
Abstract: When learning models that are represented in matrix forms, enforcing a low-rank constraint can dramatically improve the memory and run time complexity, while providing a natural regularization of the model. However, naive approaches to minimizing functions over the set of low-rank matrices are either prohibitively time consuming (repeated singular value decomposition of the matrix) or numerically unstable (optimizing a factored representation of the low-rank matrix). We build on recent advances in optimization over manifolds, and describe an iterative online learning procedure, consisting of a gradient step, followed by a second-order retraction back to the manifold. While the ideal retraction is costly to compute, and so is the projection operator that approximates it, we describe another retraction that can be computed efficiently. It has run time and memory complexity of O ((n + m)k) for a rank-k matrix of dimension m × n, when using an online procedure with rank-one gradients. We use this algorithm, L ORETA, to learn a matrix-form similarity measure over pairs of documents represented as high dimensional vectors. L ORETA improves the mean average precision over a passive-aggressive approach in a factorized model, and also improves over a full model trained on pre-selected features using the same memory requirements. We further adapt L ORETA to learn positive semi-definite low-rank matrices, providing an online algorithm for low-rank metric learning. L ORETA also shows consistent improvement over standard weakly supervised methods in a large (1600 classes and 1 million images, using ImageNet) multi-label image classification task. Keywords: low rank, Riemannian manifolds, metric learning, retractions, multitask learning, online learning
5 0.40553892 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
6 0.35515633 82 jmlr-2012-On the Necessity of Irrelevant Variables
7 0.32825595 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
9 0.31710386 97 jmlr-2012-Regularization Techniques for Learning with Matrices
10 0.31390145 10 jmlr-2012-A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss
11 0.30463055 73 jmlr-2012-Multi-task Regression using Minimal Penalties
12 0.26128528 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
13 0.25988761 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
14 0.25589648 49 jmlr-2012-Hope and Fear for Discriminative Training of Statistical Translation Models
15 0.25567204 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
16 0.25371021 80 jmlr-2012-On Ranking and Generalization Bounds
17 0.24972494 14 jmlr-2012-Activized Learning: Transforming Passive to Active with Improved Label Complexity
18 0.22557221 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
19 0.22445881 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
20 0.22361179 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
topicId topicWeight
[(7, 0.015), (21, 0.026), (26, 0.041), (29, 0.03), (35, 0.021), (49, 0.01), (56, 0.482), (69, 0.014), (75, 0.035), (79, 0.03), (92, 0.122), (96, 0.082)]
simIndex simValue paperId paperTitle
1 0.9672237 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
Author: Stephen R. Piccolo, Lewis J. Frey
Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research
same-paper 2 0.80799979 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
3 0.43502682 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
4 0.37700701 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
Author: Michael W. Mahoney, Lorenzo Orecchia, Nisheeth K. Vishnoi
Abstract: The second eigenvalue of the Laplacian matrix and its associated eigenvector are fundamental features of an undirected graph, and as such they have found widespread use in scientific computing, machine learning, and data analysis. In many applications, however, graphs that arise have several local regions of interest, and the second eigenvector will typically fail to provide information fine-tuned to each local region. In this paper, we introduce a locally-biased analogue of the second eigenvector, and we demonstrate its usefulness at highlighting local properties of data graphs in a semi-supervised manner. To do so, we first view the second eigenvector as the solution to a constrained optimization problem, and we incorporate the local information as an additional constraint; we then characterize the optimal solution to this new problem and show that it can be interpreted as a generalization of a Personalized PageRank vector; and finally, as a consequence, we show that the solution can be computed in nearly-linear time. In addition, we show that this locally-biased vector can be used to compute an approximation to the best partition near an input seed set in a manner analogous to the way in which the second eigenvector of the Laplacian can be used to obtain an approximation to the best partition in the entire input graph. Such a primitive is useful for identifying and refining clusters locally, as it allows us to focus on a local region of interest in a semi-supervised manner. Finally, we provide a detailed empirical evaluation of our method by showing how it can applied to finding locally-biased sparse cuts around an input vertex seed set in social and information networks. Keywords: spectral graph partitioning, local spectral algorithms, Laplacian matrix, semi-supervised learning, personalized pagerank c 2012 Michael W. Mahoney and Lorenzo Orecchia and Nisheeth K. Vishnoi. M AHONEY, O RECCHIA , AND V ISHNOI
6 0.37058711 75 jmlr-2012-NIMFA : A Python Library for Nonnegative Matrix Factorization
7 0.35538271 53 jmlr-2012-Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences
8 0.3540298 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
9 0.35027909 84 jmlr-2012-Online Submodular Minimization
10 0.34981567 82 jmlr-2012-On the Necessity of Irrelevant Variables
11 0.34945938 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
12 0.34737992 13 jmlr-2012-Active Learning via Perfect Selective Classification
13 0.34717762 80 jmlr-2012-On Ranking and Generalization Bounds
14 0.34629229 104 jmlr-2012-Security Analysis of Online Centroid Anomaly Detection
15 0.34551415 42 jmlr-2012-Facilitating Score and Causal Inference Trees for Large Observational Studies
16 0.3448239 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
17 0.34020218 34 jmlr-2012-Dynamic Policy Programming
18 0.339726 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
19 0.33853567 69 jmlr-2012-Mixability is Bayes Risk Curvature Relative to Log Loss
20 0.33777636 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality