jmlr jmlr2012 jmlr2012-103 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
Reference: text
sentIndex sentText sentNum sentScore
1 In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. [sent-8, score-0.233]
2 We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. [sent-10, score-0.666]
3 An attractive solution to this problem involves using the Nystr¨ m method to generate a o low-rank approximation of the original matrix from a subset of its columns (Williams and Seeger, 2000). [sent-20, score-0.224]
4 This paper presents an analysis of different sampling techniques for the Nystr¨ m method o both empirically and theoretically. [sent-23, score-0.222]
5 The Nystr¨ m method o was first introduced to the machine learning community (Williams and Seeger, 2000) using uniform sampling without replacement, and this remains the sampling method most commonly used in practice (Talwalkar et al. [sent-25, score-0.458]
6 More recently, the Nystr¨ m method has been theoretically analyzed assuming sampling o from fixed, non-uniform distributions over the columns (Drineas and Mahoney, 2005; Belabbas and Wolfe, 2009; Mahoney and Drineas, 2009). [sent-28, score-0.323]
7 In this work, we present novel experiments with several real-world data sets comparing the performance of the Nystr¨ m method when used with uniform o versus non-uniform sampling distributions. [sent-29, score-0.254]
8 , 2008), our results are the first to compare uniform sampling with the sampling technique for which the Nystr¨ m o method has theoretical guarantees. [sent-32, score-0.465]
9 (2006), in which an adaptive, error-driven sampling technique with relative error bounds was introduced for the related problem of matrix projection (see Kumar et al. [sent-39, score-0.287]
10 (2006) for adaptive sampling and uses only a small submatrix at each step. [sent-43, score-0.233]
11 Our empirical results suggest a trade-off between time and space requirements, as adaptive techniques spend more time to find a concise subset of informative columns but provide improved approximation accuracy. [sent-44, score-0.252]
12 Next, we show that a new family of algorithms based on mixtures of Nystr¨ m approximations, o ensemble Nystr¨ m algorithms, yields more accurate low-rank approximations than the standard o Nystr¨ m method. [sent-45, score-0.223]
13 We describe several variants of these algorithms, including one based on simple averaging of p Nystr¨ m solutions, an exponential o weighting method, and a regression method which consists of estimating the mixture parameters of the ensemble using a few columns sampled from the matrix. [sent-48, score-0.364]
14 We first present a novel bound for the Nystr¨ m method as it is often used in practice, that is, using uniform sampling without reo placement. [sent-56, score-0.276]
15 In Section 4, o we provide a survey of various adaptive techniques used for sampling-based low-rank approximation and introduce a novel adaptive sampling algorithm. [sent-61, score-0.349]
16 We can describe this matrix in terms of its SVD as Tk = UT,k ΣT,k V⊤ where ΣT,k is a diagonal matrix of the top k singular values of T and UT,k and T,k VT,k are the matrices formed by the associated left and right singular vectors. [sent-76, score-0.22]
17 Let C denote the n × l matrix formed by these columns and W the l × l matrix consisting of the intersection of these l columns with the corresponding l rows of K. [sent-86, score-0.31]
18 Without loss of generality, the columns and rows of K can be rearranged based on this sampling so that K and C be written as follows: K= W K⊤ 21 K21 K22 and 983 C= W . [sent-88, score-0.305]
19 Assuming a uniform sampling of o the columns, the Nystr¨ m method generates a rank-k approximation K of K for k < n defined by: o Knys = CW+ C⊤ ≈ K, k k where Wk is the best k-rank approximation of W with respect to the spectral or Frobenius norm and o W+ denotes the pseudo-inverse of Wk . [sent-91, score-0.404]
20 , 2006; Rudelson and Vershynin, 2007); relative error bounds using adaptive sampling techniques (Deshpande et al. [sent-114, score-0.291]
21 The theoretical performance of CUR approximations has been analyzed using a variety of sampling schemes, although the columnselection processes associated with these analyses often require operating on the entire input matrix (Goreinov et al. [sent-123, score-0.263]
22 More recently, it was presented in Williams and Seeger o (2000) to speed up kernel algorithms and has been studied theoretically using a variety of sampling schemes (Smola and Sch¨ lkopf, 2000; Drineas and Mahoney, 2005; Zhang et al. [sent-128, score-0.23]
23 A closely related algorithm, known as the Incomplete Cholesky Decomposition (Fine and Scheinberg, 2002; Bach and Jordan, 2002, 2005), can also be viewed as a specific sampling technique associated with the Nystr¨ m method (Bach and Jordan, 2005). [sent-135, score-0.229]
24 In the remainder of the paper, we will discuss various sampling options that aim to select informative columns from K. [sent-142, score-0.322]
25 We begin with the 985 K UMAR , M OHRI AND TALWALKAR most common class of sampling techniques that select columns using a fixed probability distribution. [sent-143, score-0.323]
26 The most basic sampling technique involves uniform sampling of the columns. [sent-144, score-0.447]
27 There are additional computational costs associated with these non-uniform sampling methods: O(n) time and space requirements for diagonal sampling and O(n2 ) time and space for column-norm sampling. [sent-147, score-0.402]
28 These non-uniform sampling techniques are often presented using sampling with replacement to simplify theoretical analysis. [sent-148, score-0.457]
29 Column-norm sampling has been used to analyze a general SVD approximation algorithm. [sent-149, score-0.237]
30 Further, diagonal sampling with replacement was used by Drineas and Mahoney (2005) and Belabbas and Wolfe (2009) to bound the reconstruction error of the Nystr¨ m method. [sent-150, score-0.326]
31 2 In Drineas and Mahoney o (2005) however, the authors suggest that column-norm sampling would be a better sampling assumption for the analysis of the Nystr¨ m method. [sent-151, score-0.372]
32 Previous studies have compared uniform and non-uniform in a more restrictive setting, using fewer types of kernels and focusing only on column-norm sampling (Drineas et al. [sent-156, score-0.236]
33 2 Experiments We used the data sets described in the previous section to test the approximation accuracy for each sampling method. [sent-166, score-0.271]
34 Low-rank approximations of K were generated using the Nystr¨ m method along o with these sampling methods, and we measured the accuracy of reconstruction relative to the optimal 2. [sent-167, score-0.362]
35 Although Drineas and Mahoney (2005) claim to weight each column proportionally to K2 , they in fact use the ii diagonal sampling we present in this work, that is, weights proportional to Kii (Drineas, 2008). [sent-168, score-0.235]
36 7K PIE-7K MNIST ESS ABN Type faces (profile) faces (front) digit images proteins abalones n 2731 7412 4000 4728 4177 d 2304 2304 784 16 8 Kernel linear linear linear RBF RBF Table 1: Description of the data sets and kernels used in fixed and adaptive sampling experiments (Sim et al. [sent-170, score-0.233]
37 0) l/n (b) Figure 1: (a) Nystr¨ m relative accuracy for various sampling techniques on PIE-7K. [sent-228, score-0.278]
38 (b) Nystr¨ m o o relative accuracy for various sampling methods for two values of l/n with k = 100. [sent-229, score-0.26]
39 No error (‘-’) is reported for diagonal sampling with RBF kernels since diagonal sampling is equivalent to uniform sampling in this case. [sent-232, score-0.668]
40 We first compared the effectiveness of the three sampling techniques using sampling with replacement. [sent-236, score-0.39]
41 The results across all data sets show that uniform sampling outperforms all other methods, while being much cheaper computationally and space-wise. [sent-238, score-0.236]
42 Thus, while non-uniform sampling techniques may be effective in extreme cases where a few columns of K dominate in terms of · 2 , this situation does not tend to arise with real-world data, where uniform sampling is most effective. [sent-239, score-0.559]
43 1) (b) Figure 2: Comparison of uniform sampling with and without replacement measured by the difference in relative accuracy. [sent-281, score-0.343]
44 (a) Improvement in relative accuracy for PIE-7K when sampling without replacement. [sent-282, score-0.26]
45 (b) Improvement in relative accuracy when sampling without replacement across all data sets for various l/n percentages. [sent-283, score-0.327]
46 Next, we compared the performance of uniform sampling with and without replacement. [sent-284, score-0.236]
47 The results show that uniform sampling without replacement improves the accuracy of the Nystr¨ m method over sampling with reo placement, even when sampling less than 5% of the total columns. [sent-287, score-0.749]
48 In summary, these experimental 988 ¨ S AMPLING M ETHODS FOR THE N YSTR OM M ETHOD show that uniform sampling without replacement is the cheapest and most efficient sampling technique across several data sets (it is also the most commonly used method in practice). [sent-288, score-0.532]
49 In this section, we discuss various sampling options that aim to select more informative columns from K, while storing and operating on only O(ln) entries of K. [sent-292, score-0.322]
50 o Whereas SMGA was proposed as a sampling scheme to be used in conjunction with the Nystr¨ m o method, ICL generates a low-rank factorization of K on-the-fly as it adaptively selects columns based on potential pivots of the Incomplete Cholesky Decomposition. [sent-297, score-0.305]
51 Related greedy adaptive sampling techniques were proposed by Ouimet and Bengio (2005) and Liu et al. [sent-302, score-0.251]
52 Finally, an adaptive sampling technique with strong theoretical foundations (adaptive-full) was proposed in Deshpande et al. [sent-308, score-0.258]
53 (2006) and then present empirical results comparing the performance of this new algorithm with uniform sampling as well as SMGA, ICL, K-means and the adaptive-full techniques. [sent-312, score-0.236]
54 1 Adaptive Nystr¨ m Sampling o Instead of sampling all l columns from a fixed distribution, adaptive sampling alternates between selecting a set of columns and updating the distribution over all the columns. [sent-314, score-0.657]
55 The probabilities are then updated as a function of previously chosen columns and s new columns are sampled and incorporated in C′ . [sent-316, score-0.262]
56 n] do 5 if j ∈ R then 6 Pj ← 0 7 else Pj ← ||E j ||2 2 P 8 P ← ||P||2 9 return P Figure 3: The adaptive sampling technique (Deshpande et al. [sent-327, score-0.258]
57 We propose a simple sampling technique (adaptive-partial) that incorporates the advantages of adaptive sampling while avoiding the computational and storage burdens of the adaptive-full technique. [sent-329, score-0.462]
58 One artifact of using a k′ smaller than the rank of C′ is that all the columns of K will have a non-zero probability of being selected, which could lead to the selection of previously selected columns in the next iteration. [sent-334, score-0.274]
59 n] do 7 if j ∈ R then 8 Pj ← 0 £ sample without replacement 9 else Pj ← || E( j) ||2 2 P 10 P ← ||P||2 11 return P Figure 4: The proposed adaptive sampling technique that uses a small subset of the original matrix K to adaptively choose columns. [sent-342, score-0.361]
60 6) Table 2: Nystr¨ m spectral reconstruction accuracy for various sampling methods for all data sets for o k = 100 and three l/n percentages. [sent-515, score-0.311]
61 2 Experiments We used the data sets in Table 1, and compared the effect of different sampling techniques on the relative accuracy of Nystr¨ m spectral reconstruction for k = 100. [sent-520, score-0.369]
62 We further note that ICL performs the worst of all the adaptive techniques, and it is often worse than random sampling (this observation is also noted by Zhang et al. [sent-618, score-0.233]
63 The empirical results also suggest that the performance gain due to adaptive sampling is inversely proportional to the percentage of sampled columns—random sampling actually outperforms many of the adaptive approaches when sampling 20% of the columns. [sent-620, score-0.676]
64 (2010), (5) provides an alternative description of the ensemble Nystr¨ m o + , where W method as a block diagonal approximation of Wens ens is the l p × l p SPSD matrix associated with the l p sampled columns. [sent-650, score-0.363]
65 In this study, we focus on the class of base learners generated from Nystr¨ m approximation with uniform sampling o of columns or from the adaptive K-means method. [sent-666, score-0.592]
66 Alternatively, these base learners could be generated using other (or a combination of) sampling schemes discussed in Sections 3 and 4. [sent-667, score-0.347]
67 Let KV denote the matrix formed by using the samples from V as its columns and let KV denote the submatrix of Kr containing the columns corresponding to r ˆ the columns in V . [sent-669, score-0.393]
68 o The mixture weights can be computed in constant time for the uniform method, in O(psn) for the exponential weight method, or in O(p3 + p2 ns) for the ridge regression method where O(p2 ns) time is required to compute a p × p matrix and O(p3 ) time is required for inverting this matrix. [sent-678, score-0.261]
69 In contrast, the ensemble o Nystr¨ m method represents K as the sum of products of low-rank matrices, where each of the p o terms corresponds to a base learner. [sent-688, score-0.262]
70 Hence, when working on a cluster, using an ensemble Nystr¨ m approximation o along with the Woodbury approximation requires only a log2 (p) factor more time than using the standard Nystr¨ m method. [sent-696, score-0.284]
71 1 E NSEMBLE N YSTR OM W ITH VARIOUS M IXTURE W EIGHTS We first show results for the ensemble Nystr¨ m method using different techniques to choose the o mixture weights, as previously discussed. [sent-704, score-0.239]
72 In these experiments, we focused on base learners generated via the Nystr¨ m method with uniform sampling of columns. [sent-705, score-0.393]
73 Furthermore, for the exponential o and the ridge regression variants, we sampled a set of s = 20 columns and used an additional 20 columns (s′ ) as a hold-out set for selecting the optimal values of η and λ. [sent-706, score-0.379]
74 0 Table 4: Relative accuracy for ensemble Nystr¨ m method with Nystr¨ m base learners generated o o with uniform sampling of columns or via the K-means algorithm. [sent-770, score-0.728]
75 2 E FFECT OF R ANK As mentioned earlier, the rank of the ensemble approximations can be p times greater than the rank of each of the base learners. [sent-773, score-0.357]
76 Hence, to validate the results in Figure 5, we performed a simple experiment in which we compared the performance of the best base learner to the best rank k approximation of the uniform ensemble approximation (obtained via SVD of the uniform ensemble approximation). [sent-774, score-0.69]
77 In these experiments, we again used base learners generated via the Nystr¨ m method with uniform sampling of columns. [sent-782, score-0.393]
78 4 E NSEMBLE K- MEANS N YSTR OM In the previous experiments, we focused on base learners generated via the Nystr¨ m method with o uniform sampling of columns. [sent-785, score-0.393]
79 We fixed the number of base learners to p = 10 and when using ridge regression o to learn weights, we set s = s′ = 20. [sent-787, score-0.256]
80 As shown in Table 4, similar performance gains in comparison to the average or best base learner can be seen when using an ensemble of base learners derived from the K-means algorithm. [sent-788, score-0.409]
81 uni exp ridge optimal 55 50 45 0 25 5 10 15 20 25 Number of base learners (p) Number of base learners (p) Ensemble Method − MNIST Relative Accuracy 60 55 50 45 best b. [sent-795, score-0.439]
82 uni exp ridge optimal 40 35 30 0 5 10 15 20 25 Number of base learners (p) Ensemble Method − ESS Ensemble Method − ABN 65 Relative Accuracy Relative Accuracy 50 45 40 best b. [sent-797, score-0.3]
83 uni exp ridge optimal 35 30 25 0 5 10 15 20 60 55 50 45 40 35 0 25 Number of base learners (p) best b. [sent-799, score-0.3]
84 uni exp ridge optimal 5 10 15 20 25 Number of base learners (p) Figure 5: Relative accuracy for ensemble Nystr¨ m method using uniform (‘uni’), exponential o (‘exp’), ridge (‘ridge’) and optimal (‘optimal’) mixture weights as well as the best (‘best b. [sent-801, score-0.741]
85 ’) of the p base learners used to create the ensemble approximations. [sent-803, score-0.321]
86 uni uni rank−k 5 10 15 20 25 Number of base learners (p) Number of base learners (p) Effect of Rank − MNIST Relative Accuracy 50 45 best b. [sent-811, score-0.366]
87 uni uni rank−k 40 35 30 5 10 15 20 25 Number of base learners (p) Effect of Rank − ESS Effect of Rank − ABN 55 45 40 35 best b. [sent-813, score-0.227]
88 uni uni rank−k 30 25 5 10 15 20 Relative Accuracy Relative Accuracy 50 50 45 35 25 Number of base learners (p) best b. [sent-815, score-0.227]
89 uni uni rank−k 40 5 10 15 20 25 Number of base learners (p) Figure 6: Relative accuracy for ensemble Nystr¨ m method using uniform (‘uni’) mixture weights, o the optimal rank-k approximation of the uniform ensemble result (‘uni rank-k’) as well as the best (‘best b. [sent-817, score-0.815]
90 ’) of the p base learners used to create the ensemble approximations. [sent-819, score-0.321]
91 sampling without replacement, and holds for both the standard Nystr¨ m method as well as the o ensemble Nystr¨ m method discussed in Section 5. [sent-820, score-0.404]
92 o Our theoretical analysis of the Nystr¨ m method uses some results previously shown by Drineas o and Mahoney (2005) as well as the following generalization of McDiarmid’s concentration bound to sampling without replacement (Cortes et al. [sent-821, score-0.271]
93 o Theorem 2 Let K denote the rank-k Nystr¨ m approximation of K based on l columns sampled o uniformly at random without replacement from K, and Kk the best rank-k approximation of K. [sent-858, score-0.312]
94 Proof To bound the norm-2 error of the Nystr¨ m method in the scenario of sampling without reo placement, we start with the following general inequality given by Drineas and Mahoney (2005)[Proof of Lemma 4]: K − K 2 ≤ K − Kk 2 + 2 XX⊤ − ZZ⊤ 2 , where Z = n XS. [sent-860, score-0.226]
95 Let S′ be a sampling matrix selecting the same columns as S except for one, and let Z′ denote n XS′ . [sent-862, score-0.341]
96 o Theorem 3 Let S be a sample of pl columns drawn uniformly at random without replacement from K, decomposed into p subsamples of size l, S1 , . [sent-875, score-0.221]
97 However, the bounds for the ensemble Nystr¨ m are tighter than those for any Nystr¨ m expert based on a single sample of o o size l even for a uniform weighting. [sent-892, score-0.25]
98 We discussed both fixed and adaptive methods for sampling the columns of a matrix. [sent-896, score-0.352]
99 We saw that the approximation performance is significantly affected by the choice of the sampling algorithm and also that there is a tradeoff between choosing a more informative set of columns and the efficiency of the sampling algorithm. [sent-897, score-0.559]
100 We concluded with a theoretical analysis of the Nystr¨ m method o (both the standard approach and the ensemble method) as it is often used in practice, namely using uniform sampling without replacement. [sent-899, score-0.436]
wordName wordTfidf (topN-words)
[('nystr', 0.738), ('sampling', 0.186), ('ensemble', 0.182), ('talwalkar', 0.159), ('drineas', 0.144), ('om', 0.127), ('ystr', 0.127), ('columns', 0.119), ('abn', 0.118), ('kk', 0.118), ('ridge', 0.117), ('ess', 0.116), ('umar', 0.11), ('mahoney', 0.096), ('ohri', 0.094), ('icl', 0.093), ('kens', 0.093), ('kr', 0.09), ('learners', 0.077), ('deshpande', 0.076), ('xx', 0.074), ('mnist', 0.072), ('pie', 0.068), ('replacement', 0.067), ('ethod', 0.067), ('ampling', 0.067), ('kumar', 0.067), ('zr', 0.065), ('ameet', 0.065), ('base', 0.062), ('svd', 0.061), ('ethods', 0.06), ('kmax', 0.06), ('belabbas', 0.059), ('smga', 0.059), ('spsd', 0.059), ('woodbury', 0.058), ('zz', 0.052), ('approximation', 0.051), ('nys', 0.051), ('sanjiv', 0.051), ('rep', 0.051), ('petros', 0.051), ('uniform', 0.05), ('spectral', 0.048), ('singular', 0.047), ('adaptive', 0.047), ('dmax', 0.046), ('uni', 0.044), ('reconstruction', 0.043), ('cur', 0.043), ('kv', 0.043), ('approximations', 0.041), ('relative', 0.04), ('rank', 0.036), ('matrix', 0.036), ('pl', 0.035), ('mehryar', 0.035), ('frobenius', 0.034), ('sr', 0.034), ('accuracy', 0.034), ('diagonal', 0.03), ('mohri', 0.029), ('kii', 0.029), ('cortes', 0.027), ('cand', 0.026), ('learner', 0.026), ('zl', 0.026), ('nkmax', 0.025), ('unys', 0.025), ('technique', 0.025), ('matrices', 0.024), ('sampled', 0.024), ('wolfe', 0.023), ('rbf', 0.023), ('kernel', 0.022), ('corinna', 0.022), ('fine', 0.022), ('pj', 0.022), ('schemes', 0.022), ('reo', 0.022), ('fowlkes', 0.022), ('ens', 0.022), ('sch', 0.021), ('mixture', 0.021), ('lkopf', 0.02), ('bach', 0.02), ('smola', 0.02), ('halko', 0.02), ('pdate', 0.02), ('santosh', 0.02), ('weights', 0.019), ('storage', 0.018), ('rostamizadeh', 0.018), ('qr', 0.018), ('method', 0.018), ('techniques', 0.018), ('expert', 0.018), ('informative', 0.017), ('bcd', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
2 0.13146228 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
3 0.071236059 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection
4 0.049715575 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
5 0.044364382 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
Author: James Bergstra, Yoshua Bengio
Abstract: Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efÄ?Ĺš cient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to conÄ?Ĺš gure neural networks and deep belief networks. Compared with neural networks conÄ?Ĺš gured by a pure grid search, we Ä?Ĺš nd that random search over the same domain is able to Ä?Ĺš nd models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search Ä?Ĺš nds better models by effectively searching a larger, less promising conÄ?Ĺš guration space. Compared with deep belief networks conÄ?Ĺš gured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional conÄ?Ĺš guration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for conÄ?Ĺš guring algorithms for new data sets. Our analysis casts some light on why recent “High Throughputâ€? methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms. Keyword
6 0.036086097 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
7 0.03521372 97 jmlr-2012-Regularization Techniques for Learning with Matrices
8 0.033690508 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
9 0.033548936 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
10 0.032952942 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
11 0.032806024 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
12 0.031587038 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
13 0.031499203 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
14 0.029302225 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
15 0.027733969 35 jmlr-2012-EP-GIG Priors and Applications in Bayesian Sparse Learning
16 0.026918337 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
17 0.02664707 28 jmlr-2012-Confidence-Weighted Linear Classification for Text Categorization
18 0.026574112 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
20 0.025806203 77 jmlr-2012-Non-Sparse Multiple Kernel Fisher Discriminant Analysis
topicId topicWeight
[(0, -0.134), (1, 0.034), (2, 0.067), (3, 0.031), (4, -0.016), (5, -0.009), (6, -0.013), (7, -0.039), (8, -0.017), (9, -0.125), (10, -0.128), (11, -0.01), (12, -0.043), (13, 0.022), (14, 0.083), (15, 0.031), (16, 0.211), (17, 0.034), (18, 0.091), (19, -0.134), (20, -0.102), (21, -0.285), (22, -0.178), (23, 0.131), (24, 0.121), (25, 0.068), (26, 0.26), (27, 0.155), (28, 0.043), (29, 0.077), (30, -0.013), (31, 0.033), (32, 0.159), (33, 0.014), (34, -0.023), (35, -0.097), (36, 0.193), (37, 0.061), (38, -0.069), (39, 0.148), (40, -0.022), (41, 0.054), (42, 0.024), (43, -0.121), (44, -0.026), (45, 0.057), (46, -0.158), (47, 0.123), (48, 0.03), (49, -0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.96037513 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
2 0.66917872 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
3 0.41696265 16 jmlr-2012-Algorithms for Learning Kernels Based on Centered Alignment
Author: Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression. Keywords: kernel methods, learning kernels, feature selection
4 0.28370395 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
Author: Karthik Mohan, Maryam Fazel
Abstract: The problem of minimizing the rank of a matrix subject to affine constraints has applications in several areas including machine learning, and is known to be NP-hard. A tractable relaxation for this problem is nuclear norm (or trace norm) minimization, which is guaranteed to find the minimum rank matrix under suitable assumptions. In this paper, we propose a family of Iterative Reweighted Least Squares algorithms IRLS-p (with 0 ≤ p ≤ 1), as a computationally efficient way to improve over the performance of nuclear norm minimization. The algorithms can be viewed as (locally) minimizing certain smooth approximations to the rank function. When p = 1, we give theoretical guarantees similar to those for nuclear norm minimization, that is, recovery of low-rank matrices under certain assumptions on the operator defining the constraints. For p < 1, IRLSp shows better empirical performance in terms of recovering low-rank matrices than nuclear norm minimization. We provide an efficient implementation for IRLS-p, and also present a related family of algorithms, sIRLS-p. These algorithms exhibit competitive run times and improved recovery when compared to existing algorithms for random instances of the matrix completion problem, as well as on the MovieLens movie recommendation data set. Keywords: matrix rank minimization, matrix completion, iterative algorithms, null-space property
5 0.2316246 61 jmlr-2012-ML-Flex: A Flexible Toolbox for Performing Classification Analyses In Parallel
Author: Stephen R. Piccolo, Lewis J. Frey
Abstract: Motivated by a need to classify high-dimensional, heterogeneous data from the bioinformatics domain, we developed ML-Flex, a machine-learning toolbox that enables users to perform two-class and multi-class classification analyses in a systematic yet flexible manner. ML-Flex was written in Java but is capable of interfacing with third-party packages written in other programming languages. It can handle multiple input-data formats and supports a variety of customizations. MLFlex provides implementations of various validation strategies, which can be executed in parallel across multiple computing cores, processors, and nodes. Additionally, ML-Flex supports aggregating evidence across multiple algorithms and data sets via ensemble learning. This open-source software package is freely available from http://mlflex.sourceforge.net. Keywords: toolbox, classification, parallel, ensemble, reproducible research
6 0.23113713 38 jmlr-2012-Entropy Search for Information-Efficient Global Optimization
7 0.22100909 95 jmlr-2012-Random Search for Hyper-Parameter Optimization
8 0.2173913 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
9 0.21094981 83 jmlr-2012-Online Learning in the Embedded Manifold of Low-rank Matrices
10 0.20512892 62 jmlr-2012-MULTIBOOST: A Multi-purpose Boosting Package
11 0.20449264 108 jmlr-2012-Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing
12 0.18981498 17 jmlr-2012-An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity
13 0.17832343 47 jmlr-2012-GPLP: A Local and Parallel Computation Toolbox for Gaussian Process Regression
14 0.1617361 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
15 0.16124691 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
17 0.1502202 97 jmlr-2012-Regularization Techniques for Learning with Matrices
18 0.14577213 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
19 0.1443125 44 jmlr-2012-Feature Selection via Dependence Maximization
20 0.14105526 12 jmlr-2012-Active Clustering of Biological Sequences
topicId topicWeight
[(7, 0.014), (21, 0.038), (26, 0.054), (29, 0.034), (35, 0.028), (49, 0.033), (51, 0.346), (56, 0.07), (57, 0.011), (69, 0.015), (75, 0.045), (77, 0.012), (79, 0.018), (92, 0.096), (96, 0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.68808347 103 jmlr-2012-Sampling Methods for the Nyström Method
Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar
Abstract: The Nystr¨ m method is an efficient technique to generate low-rank matrix approximations and is o used in several large-scale learning applications. A key aspect of this method is the procedure according to which columns are sampled from the original matrix. In this work, we explore the efficacy of a variety of fixed and adaptive sampling schemes. We also propose a family of ensemble-based sampling algorithms for the Nystr¨ m method. We report results of extensive experiments o that provide a detailed comparison of various fixed and adaptive sampling techniques, and demonstrate the performance improvement associated with the ensemble Nystr¨ m method when used in o conjunction with either fixed or adaptive sampling schemes. Corroborating these empirical findings, we present a theoretical analysis of the Nystr¨ m method, providing novel error bounds guaro anteeing a better convergence rate of the ensemble Nystr¨ m method in comparison to the standard o Nystr¨ m method. o Keywords: low-rank approximation, nystr¨ m method, ensemble methods, large-scale learning o
2 0.68282712 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
3 0.43324226 43 jmlr-2012-Fast Approximation of Matrix Coherence and Statistical Leverage
Author: Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, David P. Woodruff
Abstract: The statistical leverage scores of a matrix A are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nystr¨ m-based low-rank o matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary n × d matrix A, with n ≫ d, and that returns as output relative-error approximations to all n of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of n and d) in O(nd log n) time, as opposed to the O(nd 2 ) time required by the na¨ve algorithm that involves computing an orthogonal basis for the ı range of A. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with n ≈ d, and the extension to streaming environments. Keywords: matrix coherence, statistical leverage, randomized algorithm
4 0.4176566 85 jmlr-2012-Optimal Distributed Online Prediction Using Mini-Batches
Author: Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, Lin Xiao
Abstract: Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of web-scale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed mini-batch algorithm, a method of converting many serial gradient-based online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closely-related distributed stochastic optimization problem, achieving an asymptotically linear speed-up over multiple processors. Finally, we demonstrate the merits of our approach on a web-scale online prediction problem. Keywords: distributed computing, online learning, stochastic optimization, regret bounds, convex optimization
5 0.40994745 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning
Author: Sangkyun Lee, Stephen J. Wright
Abstract: Iterative methods that calculate their steps from approximate subgradient directions have proved to be useful for stochastic learning problems over large and streaming data sets. When the objective consists of a loss function plus a nonsmooth regularization term, the solution often lies on a lowdimensional manifold of parameter space along which the regularizer is smooth. (When an ℓ1 regularizer is used to induce sparsity in the solution, for example, this manifold is defined by the set of nonzero components of the parameter vector.) This paper shows that a regularized dual averaging algorithm can identify this manifold, with high probability, before reaching the solution. This observation motivates an algorithmic strategy in which, once an iterate is suspected of lying on an optimal or near-optimal manifold, we switch to a “local phase” that searches in this manifold, thus converging rapidly to a near-optimal point. Computational results are presented to verify the identification property and to illustrate the effectiveness of this approach. Keywords: regularization, dual averaging, partly smooth manifold, manifold identification
6 0.40952229 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
7 0.40308455 80 jmlr-2012-On Ranking and Generalization Bounds
8 0.40255284 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
9 0.4014442 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
10 0.40115869 71 jmlr-2012-Multi-Instance Learning with Any Hypothesis Class
11 0.4007352 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
12 0.40030581 117 jmlr-2012-Variable Selection in High-dimensional Varying-coefficient Models with Global Optimality
13 0.39885896 81 jmlr-2012-On the Convergence Rate oflp-Norm Multiple Kernel Learning
14 0.39833802 13 jmlr-2012-Active Learning via Perfect Selective Classification
15 0.39807224 73 jmlr-2012-Multi-task Regression using Minimal Penalties
16 0.39796877 111 jmlr-2012-Structured Sparsity and Generalization
17 0.39785993 82 jmlr-2012-On the Necessity of Irrelevant Variables
18 0.39426357 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
19 0.39370409 92 jmlr-2012-Positive Semidefinite Metric Learning Using Boosting-like Algorithms
20 0.39123595 34 jmlr-2012-Dynamic Policy Programming