jmlr jmlr2013 jmlr2013-59 knowledge-graph by maker-knowledge-mining

59 jmlr-2013-Large-scale SVD and Manifold Learning

Source: pdf

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

Abstract: This paper examines the efﬁcacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classiﬁcation with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. [sent-8, score-0.278]

2 We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. [sent-11, score-0.217]

3 We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). [sent-13, score-0.245]

4 Keywords: low-rank approximation, manifold learning, large-scale matrix factorization 1. [sent-16, score-0.201]

5 3130 L ARGE - SCALE SVD AND M ANIFOLD L EARNING We then examine the performance of these two low-rank approximation techniques on the task of extracting low-dimensional manifold structure given millions of high-dimensional face images. [sent-52, score-0.25]

6 Good sampling of the underlying manifold is essential for this assumption to hold. [sent-57, score-0.241]

7 In fact, many manifold learning techniques provide guarantees that the accuracy of the recovered manifold increases as the number of data samples increases (Tenenbaum et al. [sent-58, score-0.352]

8 However, there is a trade-off between improved sampling of the manifold and the computational cost of manifold learning algorithms, and we explore these computational challenges in this work. [sent-60, score-0.397]

9 Hence, in this work we evaluate the efﬁcacy of the Nystr¨ m method and the Column o sampling method on the task of large-scale manifold learning. [sent-64, score-0.241]

10 We also discuss our efﬁcient implementation of a scalable manifold learning pipeline that leverages modern distributed computing architecture in order to construct neighborhood graphs, calculate shortest paths within these graphs and ﬁnally compute large-scale low-rank matrix approximations. [sent-65, score-0.306]

11 First, we show connections between two random sampling based singular value decomposition algorithms and provide the ﬁrst direct comparison of their performances on a variety of approximation tasks. [sent-67, score-0.311]

12 In particular, we show that the Column sampling method is superior for approximating singular values, singular vectors and matrix projection approximations (deﬁned in Section 3. [sent-68, score-0.644]

13 2), while the Nystr¨ m method is better for spectral o reconstruction approximations (also deﬁned in Section 3. [sent-69, score-0.205]

14 Second, we apply these two algorithms to the task of large-scale manifold learning and present the largest scale study so far on manifold learning, using 18M data points. [sent-71, score-0.312]

15 To date, the largest manifold learning study involves the analysis of music data using 267K points (Platt, 2004). [sent-72, score-0.156]

16 Assuming that rank(T) = r, we can write the compact Singular Value Decomposition (SVD) of this matrix as T = UT ΣT V⊤ where ΣT is diagonal and contains the T singular values of T sorted in decreasing order and UT ∈ Ra×r and VT ∈ Rb×r are corresponding the left and right singular vectors of T. [sent-87, score-0.465]

17 Let C denote the n×l matrix formed by these columns and W the l ×l matrix consisting of the intersection of these l columns with the corresponding l rows of K. [sent-96, score-0.19]

18 More recently, it was presented in Williams o and Seeger (2000) to speed up kernel algorithms and has been used in applications ranging from manifold learning to image segmentation (Platt, 2004; Fowlkes et al. [sent-102, score-0.198]

19 Assuming a uniform sampling of o the columns, the Nystr¨ m method generates a rank-k approximation K of K for k < n deﬁned by: o Knys = CW+ C⊤ ≈ K, k k (2) where Wk is the best k-rank approximation of W with respect to the spectral or Frobenius norm and ⊤ W+ denotes the pseudo-inverse of Wk . [sent-106, score-0.224]

20 3 If we write the SVD ⊤ of C as C = UC ΣC VC then the Column sampling method approximates the top k singular values (Σk ) and singular vectors (Uk ) of K as: Σcol = n ΣC,k l + and Ucol = UC = CVC,k ΣC,k . [sent-113, score-0.505]

21 Nystr¨ m Versus Column Sampling o Given that two sampling-based techniques exist to approximate the SVD of SPSD matrices, we pose a natural question: which method should one use to approximate singular values, singular vectors and low-rank approximations? [sent-117, score-0.469]

22 1 Singular Values and Singular Vectors As shown in (3) and (4), the singular values of K are approximated as the scaled singular values of W and C, respectively. [sent-121, score-0.386]

23 Formally, these scaling terms make the approximations in (3) and (4) unbiased estimators of the true singular values. [sent-123, score-0.253]

24 The Column sampling singular vectors (Ucol ) are orthonormal since they are the singular vectors of C. [sent-125, score-0.58]

25 In contrast, the Nystr¨ m singular vectors o (Unys ) are approximated by extrapolating the singular vectors of W as shown in (3), and are not orthonormal. [sent-126, score-0.454]

26 3, this adversely affects the accuracy of singular vector approximation from the Nystr¨ m method. [sent-128, score-0.266]

27 It is possible to orthonormalize the Nystr¨ m singular o o + vectors by using QR decomposition. [sent-129, score-0.227]

28 2 Low-rank Approximation Several studies have empirically shown that the accuracy of low-rank approximations of kernel matrices is tied to the performance of kernel-based learning algorithms (Williams and Seeger, 2000; Talwalkar and Rostamizadeh, 2010). [sent-136, score-0.165]

29 1, the optimal Kk is given by, Kk = Uk Σk U⊤ = Uk U⊤ K = KUk U⊤ k k k where the columns of Uk are the k singular vectors of K corresponding to the top k singular values of K. [sent-141, score-0.47]

30 We refer to Uk Σk U⊤ as Spectral Reconstruction, since it uses both the singular values and k vectors of K, and Uk U⊤ K as Matrix Projection, since it uses only singular vectors to compute the k projection of K onto the space spanned by vectors Uk . [sent-142, score-0.522]

31 These two low-rank approximations are equal only if Σk and Uk contain the true singular values and singular vectors of K. [sent-143, score-0.48]

32 1 M ATRIX P ROJECTION For Column sampling using (4), the low-rank approximation via matrix projection is ⊤ Kcol = Ucol,k U⊤ K = UC,k UC,k K = C((C⊤ C)k )+ C⊤ K, col,k k (5) 2 ⊤ where (C⊤ C)−1 = VC,k (ΣC,k )+ VC,k . [sent-148, score-0.197]

33 ⊤ Theorem 1 The Column sampling and Nystr¨ m matrix projections are of the form UC RUC K, where o l×l is SPSD. [sent-154, score-0.168]

34 Further, Column sampling gives the lowest reconstruction error (measured in R∈R · F ) among all such approximations if k = l. [sent-155, score-0.217]

35 Observation 1 For k = l, matrix projection for Column sampling reconstructs C exactly. [sent-156, score-0.188]

36 l Observation 2 For k = l, the span of the orthogonalized Nystr¨ m singular vectors equals the span o of Ucol , as discussed in Section 3. [sent-158, score-0.295]

37 Hence, matrix projection is identical for Column sampling and Orthonormal Nystr¨ m for k = l. [sent-160, score-0.164]

38 7K PIE-7K MNIST ESS ABN Data faces faces digits proteins abalones n 2731 7412 4000 4728 4177 d 2304 2304 784 16 8 Kernel linear linear linear RBF RBF Table 1: Description of the data sets used in our experiments comparing sampling-based matrix approximations (Sim et al. [sent-164, score-0.225]

39 The Column sampling spectral reconstruction has a similar form as (7): Kcol = Ucol,k Σcol,k U⊤ = col,k k 1 2 n/l C (C⊤ C)k + C⊤ . [sent-172, score-0.23]

40 Theorem 2 X shows that the optimal spectral reconstruction is data dependent and may differ from the Nystr¨ m o and Column sampling approximations. [sent-178, score-0.23]

41 o Theorem 2 Column sampling and Nystr¨ m spectral reconstructions of rank k are of the form o X⊤ UX ′ ,k ZU⊤′ ,k X, where Z ∈ Rk×k is SPSD. [sent-180, score-0.206]

42 Further, among all approximations of this form, neither X the Column sampling nor the Nystr¨ m approximation is optimal (in · F ). [sent-181, score-0.178]

43 3 Empirical Comparison To test the accuracy of singular values/vectors and low-rank approximations for different methods, we used several kernel matrices arising in different applications, as described in Table 3. [sent-186, score-0.358]

44 3135 TALWALKAR , K UMAR , M OHRI AND ROWLEY For singular values, we measured percentage accuracy of the approximate singular values with respect to the exact ones. [sent-190, score-0.448]

45 The empirical results show that the Column sampling method generates more accurate singular values than the Nystr¨ m method. [sent-193, score-0.278]

46 For singular vectors, the accuracy was measured by the dot product, that is, cosine of principal angles between the exact and the approximate singular vectors. [sent-195, score-0.426]

47 Figure 1(b) shows the difference in mean accuracy between Nystr¨ m and Column sampling methods, once again bucketed by groups of o singular vectors sorted in descending order based on their corresponding singular values. [sent-196, score-0.545]

48 The top 100 singular vectors were all better approximated by Column sampling for all data sets. [sent-197, score-0.312]

49 Furthermore, even when the Nystr¨ m singular vectors o are orthogonalized, the Column sampling approximations are superior, as shown in Figure 1(c). [sent-199, score-0.372]

50 4 1 2−5 Singular Vector Buckets 6−10 11−25 26−50 51−100 Singular Vector Buckets (b) (c) Figure 1: Comparison of singular values and vectors (values above zero indicate better performance of Nystr¨ m). [sent-218, score-0.227]

51 o Next we compared the low-rank approximations generated by the two methods using matrix projection and spectral reconstruction. [sent-222, score-0.212]

52 As motivated by Theorem 1, Column sampling generates better reconstructions via matrix projection (Figure 2(a)). [sent-224, score-0.186]

53 These results may appear somewhat surprising given the relatively poor quality of the singular values/vectors for the Nystr¨ m method, but they are o in agreement with the consequences of Theorem 3 and the fact that the kernel matrices we consider (aside from ‘DEXT’) are nearly low-rank. [sent-226, score-0.258]

54 (f) Percentage of columns (l/n) needed to achieve 75% relative accuracy for Nystr¨ m spectral reconstruction as a function of n. [sent-251, score-0.235]

55 o The non-orthonormality of the Nystr¨ m method’s singular vectors (Section 3. [sent-252, score-0.227]

56 Also, the accuracy of Orthonormal Nystr¨ m spectral reconstruction is worse relative to the standard Nystr¨ m o o approximation, as shown in Figure 2(d). [sent-255, score-0.185]

57 This result can be attributed to the fact that orthonormalization of the singular vectors leads to the loss of some of the unique properties described in Section 3. [sent-256, score-0.227]

58 k k We next tested the accuracy of spectral reconstruction for the two methods for varying values of k and a ﬁxed l. [sent-260, score-0.185]

59 For each n, we performed grid search over l to ﬁnd the minimal l for which the relative accuracy of Nystr¨ m spectral reconstruction was at least 75%. [sent-264, score-0.185]

60 Out-of-sample extension is often discussed in the context of manifold learning, in which case it involves efﬁciently deriving a low-dimensional embedding for an arbitrary test point given embeddings from a set of training points (rather than rerunning the entire manifold learning algorithm). [sent-267, score-0.512]

61 Moreover, it is possible to use Column sampling to learn embeddings on the initial sample, and then use the Nystr¨ m method for subsequent out-of-sample-extension. [sent-269, score-0.246]

62 Hence, o given a large set of samples, both the Nystr¨ m method and Column sampling are viable options to o enhance the scalability of manifold learning methods, as we will explore in Section 4. [sent-270, score-0.241]

63 Although we analyzed the effectiveness of these techniques for approximating singular values, singular vectors and low-rank matrix reconstruction, we have yet to discuss the effectiveness of these techniques in the context of actual machine learning tasks. [sent-273, score-0.465]

64 In this section, we will discuss how approximate embeddings can be used in the context of manifold learning, relying on the sampling based algorithms from the previous section to generate an approximate SVD. [sent-277, score-0.402]

65 We present the largest study to date for manifold learning, and provide a quantitative comparison of Isomap and Laplacian Eigenmaps for large scale face manifold construction on clustering and classiﬁcation tasks. [sent-278, score-0.414]

66 1 I SOMAP Isomap aims to extract a low-dimensional data representation that best preserves all pairwise distances between input points, as measured by their geodesic distances along the manifold (Tenenbaum et al. [sent-290, score-0.287]

67 It approximates the geodesic distance assuming that input space distance provides good approximations for nearby points, and for faraway points it estimates distance as a series of hops between neighboring points. [sent-292, score-0.2]

68 4 It then minimizes an objective function that penalizes nearby inputs for being mapped to faraway outputs, with ‘nearness’ measured by the weight matrix W, and the solution to the objective is the bottom singular vectors of the symmetrized, normalized form of the graph Laplacian, L . [sent-299, score-0.297]

69 Using (3), the Nystr¨ m low-dimensional embeddings are: o k 1/2 + 1/2 Ynys = Σnys,k U⊤ = (ΣW )k nys,k ⊤ UW,k C⊤ . [sent-304, score-0.161]

70 Similarly, from (4) we can express the Column sampling low-dimensional embeddings as: 1/2 Ycol = Σcol,k U⊤ = col,k 4 n 1/2 (ΣC )k l + ⊤ VC,k C⊤ . [sent-305, score-0.246]

71 Further, notice that the optimal low-dimensional embeddings are in fact the square root of the optimal rank k approximation to the associated SPSD matrix, that is, Y⊤ Y = Kk , for Isomap. [sent-307, score-0.22]

72 As such, there is a connection between the task of approximating low-dimensional embeddings and the task of generating low-rank approximate spectral reconstructions, as discussed in Section 3. [sent-308, score-0.234]

73 5 −1 2 5 10 15 20 % of Columns Sampled (l / n ) Figure 3: Comparison of embeddings (values above zero indicate better performance of Nystr¨ m). [sent-319, score-0.161]

74 3 to measure the quality of the low-dimensional embeddings generated by the two techniques and see if the same trend exists. [sent-323, score-0.161]

75 We measured the quality of the low-dimensional embeddings by calculating the extent to which they preserve distances, which is the appropriate criterion in the context of manifold learning. [sent-324, score-0.317]

76 We then computed the approximate low-dimensional embeddings using the Nystr¨ m and Column sampling methods, and o then used these embeddings to compute the associated approximate squared distance matrix, D. [sent-326, score-0.432]

77 Surprisingly, we do not see the same trend in our empirical results for embeddings as we previously observed for spectral reconstruction, as the two techniques exhibit roughly similar behavior across data sets. [sent-331, score-0.234]

78 As a result, we decided to use both the Nystr¨ m and Column sampling methods for our subsequent manifold o learning study. [sent-332, score-0.241]

79 In addition to this visualization, comparison of exact neighbors and spill tree approximations for smaller subsets suggested good performance of spill trees. [sent-363, score-0.178]

80 These edges stem from inadequately sampled regions of the manifold and false positives introduced by the face detector. [sent-371, score-0.217]

81 3 A PPROXIMATING G EODESICS To construct the similarity matrix K in Isomap, one approximates geodesic distance by shortest-path lengths between every pair of nodes in the neighborhood graph. [sent-382, score-0.184]

82 (b) Number of components in the Webfaces-18M neighbor graph and the percentage of images within the largest connected component (‘% Largest’) for varying numbers of neighbors (t) with and without an upper limit on neighbor distances. [sent-401, score-0.194]

83 4 G ENERATING L OW- DIMENSIONAL E MBEDDINGS Before generating low-dimensional embeddings using Isomap, one needs to convert distances into similarities using a process called centering (Tenenbaum et al. [sent-405, score-0.221]

84 For the Nystr¨ m approximao tion, we computed W by double centering D, the l × l matrix of squared geodesic distances between 1 all landmark nodes, as W = − 2 HDH, where H = Il − 1 11⊤ is the centering matrix, Il is the l × l l identity matrix and 1 is a column vector of all ones. [sent-407, score-0.346]

85 Finally, we computed low-dimensional embeddings by multiplying the scaled singular vectors from approximate decomposition with C. [sent-411, score-0.388]

86 The PIE set contains faces in 13 poses, and such a ﬁne sampling of the pose space makes clustering and classiﬁcation tasks very challenging. [sent-465, score-0.235]

87 The 2D embeddings of PIE-35K (Figure 7 in Appendix E) reveal that Laplacian Eigenmaps projects data points into a small compact region. [sent-483, score-0.161]

88 When used for clustering, these compact embeddings lead to a few large clusters and several tiny clusters, thus explaining the high accuracy and low purity of the clusters. [sent-484, score-0.242]

89 Finally, the improved clustering results of Isomap over PCA for both data sets verify that the manifold of faces is not linear in the input space. [sent-487, score-0.257]

90 Moreover, we compared the performance of Laplacian Eigenmaps and Isomap embeddings on pose classiﬁcation. [sent-531, score-0.21]

91 Hence, using KNN to compare lowlevel embeddings indirectly measures how well nearest neighbor information is preserved. [sent-535, score-0.224]

92 Thus, we used Nystr¨ m Isomap to generate embeddings o for Webfaces-18M. [sent-542, score-0.161]

93 After learning a face manifold from Webfaces-18M, we analyzed the results with various visualizations. [sent-543, score-0.217]

94 The top row of Figure 5 shows the 2D embeddings from Nystr¨ m Isomap. [sent-544, score-0.161]

95 It is interesting to see that embeddings tend to cluster the faces by pose. [sent-546, score-0.221]

96 In Figure 5, the top right ﬁgure shows the shortest paths on the manifold between different public ﬁgures. [sent-550, score-0.212]

97 Conclusion We have studied sampling based low-rank approximation algorithms, presenting an analysis of two techniques for approximating SVD on large dense SPSD matrices and providing a theoretical and empirical comparison. [sent-556, score-0.172]

98 Although the Column sampling method generates more accurate singular values, singular vectors and low-rank matrix projections, the Nystr¨ m method constructs better o low-rank approximations, which are of great practical interest as they do not use the full matrix. [sent-557, score-0.55]

99 Although several of the images above are not faces, some are actual faces, suggesting that certain areas of the face manifold are not adequately sampled by Webfaces-18M. [sent-579, score-0.254]

100 Thus the columns of X′ span the columns of X and UX ′ ,r is an orthonormal basis for X, that is, IN − UX ′ ,r U⊤′ ,r ∈ Null(X). [sent-597, score-0.163]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nystr', 0.667), ('isomap', 0.227), ('singular', 0.193), ('talwalkar', 0.187), ('svd', 0.178), ('pie', 0.176), ('embeddings', 0.161), ('manifold', 0.156), ('rowley', 0.133), ('uc', 0.129), ('ohri', 0.114), ('umar', 0.114), ('eigenmaps', 0.114), ('abn', 0.105), ('column', 0.104), ('ux', 0.095), ('spsd', 0.089), ('anifold', 0.089), ('sampling', 0.085), ('ess', 0.081), ('knys', 0.076), ('laplacian', 0.076), ('col', 0.073), ('spectral', 0.073), ('reconstruction', 0.072), ('arge', 0.065), ('nys', 0.065), ('geodesic', 0.065), ('mnist', 0.062), ('face', 0.061), ('approximations', 0.06), ('faces', 0.06), ('kcol', 0.057), ('ruc', 0.057), ('yzy', 0.057), ('kumar', 0.051), ('columns', 0.05), ('pose', 0.049), ('neighborhood', 0.049), ('tenenbaum', 0.048), ('unys', 0.048), ('matrix', 0.045), ('neighbors', 0.042), ('kernel', 0.042), ('clustering', 0.041), ('orthonormal', 0.041), ('purity', 0.041), ('accuracy', 0.04), ('embedding', 0.039), ('dext', 0.038), ('rcol', 0.038), ('spill', 0.038), ('projections', 0.038), ('images', 0.037), ('earning', 0.036), ('kk', 0.036), ('mohri', 0.035), ('projection', 0.034), ('vectors', 0.034), ('neighbor', 0.034), ('distances', 0.033), ('approximation', 0.033), ('drineas', 0.033), ('dense', 0.031), ('nearest', 0.029), ('nlk', 0.029), ('frieze', 0.029), ('paths', 0.029), ('seeger', 0.029), ('google', 0.029), ('ameet', 0.029), ('rnys', 0.029), ('ucol', 0.029), ('zcol', 0.029), ('znys', 0.029), ('shortest', 0.027), ('centering', 0.027), ('tr', 0.026), ('rank', 0.026), ('visualization', 0.026), ('uw', 0.025), ('distance', 0.025), ('uk', 0.025), ('graph', 0.025), ('orthogonalized', 0.024), ('reconstructs', 0.024), ('fowlkes', 0.024), ('nystrom', 0.024), ('parallelized', 0.024), ('knn', 0.024), ('silva', 0.024), ('ks', 0.023), ('million', 0.023), ('matrices', 0.023), ('williams', 0.022), ('took', 0.022), ('percentage', 0.022), ('span', 0.022), ('buckets', 0.022), ('reconstructions', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

2 0.40476725 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semideﬁnite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling

3 0.14175171 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

Author: Partha Niyogi

Abstract: Manifold regularization (Belkin et al., 2006) is a geometrically motivated framework for machine learning within which several semi-supervised algorithms have been constructed. Here we try to provide some theoretical understanding of this approach. Our main result is to expose the natural structure of a class of problems on which manifold regularization methods are helpful. We show that for such problems, no supervised learner can learn effectively. On the other hand, a manifold based learner (that knows the manifold or “learns” it from unlabeled examples) can learn with relatively few labeled examples. Our analysis follows a minimax style with an emphasis on ﬁnite sample results (in terms of n: the number of labeled examples). These results allow us to properly interpret manifold regularization and related spectral and geometric algorithms in terms of their potential use in semi-supervised learning. Keywords: semi-supervised learning, manifold regularization, graph Laplacian, minimax rates

4 0.12390742 86 jmlr-2013-Parallel Vector Field Embedding

Author: Binbin Lin, Xiaofei He, Chiyuan Zhang, Ming Ji

Abstract: We propose a novel local isometry based dimensionality reduction method from the perspective of vector ﬁelds, which is called parallel vector ﬁeld embedding (PFE). We ﬁrst give a discussion on local isometry and global isometry to show the intrinsic connection between parallel vector ﬁelds and isometry. The problem of ﬁnding an isometry turns out to be equivalent to ﬁnding orthonormal parallel vector ﬁelds on the data manifold. Therefore, we ﬁrst ﬁnd orthonormal parallel vector ﬁelds by solving a variational problem on the manifold. Then each embedding function can be obtained by requiring its gradient ﬁeld to be as close to the corresponding parallel vector ﬁeld as possible. Theoretical results show that our method can precisely recover the manifold if it is isometric to a connected open subset of Euclidean space. Both synthetic and real data examples demonstrate the effectiveness of our method even if there is heavy noise and high curvature. Keywords: manifold learning, isometry, vector ﬁeld, covariant derivative, out-of-sample extension

5 0.10458369 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

Author: Nakul Verma

Abstract: Low dimensional embeddings of manifold data have gained popularity in the last decade. However, a systematic ﬁnite sample analysis of manifold embedding algorithms largely eludes researchers. Here we present two algorithms that embed a general n-dimensional manifold into Rd (where d only depends on some key manifold properties such as its intrinsic dimension, volume and curvature) that guarantee to approximately preserve all interpoint geodesic distances. Keywords: manifold learning, isometric embeddings, non-linear dimensionality reduction, Nash’s embedding theorem

6 0.074026249 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

7 0.07012435 96 jmlr-2013-Regularization-Free Principal Curve Estimation

8 0.058813237 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

9 0.050481912 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

10 0.044639558 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

11 0.041021939 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

12 0.040213719 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

13 0.03760637 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

14 0.035194211 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

15 0.033301167 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

16 0.031130856 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

17 0.030623162 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

18 0.03051788 79 jmlr-2013-On the Mutual Nearest Neighbors Estimate in Regression

19 0.02944418 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

20 0.029438855 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.196), (1, 0.158), (2, 0.023), (3, -0.421), (4, -0.082), (5, 0.091), (6, -0.1), (7, -0.19), (8, 0.189), (9, 0.34), (10, 0.402), (11, 0.01), (12, 0.181), (13, -0.04), (14, -0.065), (15, 0.091), (16, 0.018), (17, 0.091), (18, 0.108), (19, -0.017), (20, -0.082), (21, -0.013), (22, 0.021), (23, -0.003), (24, 0.007), (25, 0.029), (26, -0.008), (27, -0.012), (28, -0.041), (29, -0.031), (30, -0.026), (31, -0.052), (32, -0.027), (33, 0.001), (34, 0.009), (35, 0.039), (36, -0.002), (37, 0.024), (38, -0.002), (39, 0.011), (40, 0.008), (41, 0.001), (42, 0.007), (43, -0.001), (44, 0.011), (45, 0.003), (46, 0.028), (47, -0.032), (48, 0.021), (49, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95676404 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

2 0.90040487 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

3 0.30285233 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

Author: Partha Niyogi

4 0.28730083 86 jmlr-2013-Parallel Vector Field Embedding

Author: Binbin Lin, Xiaofei He, Chiyuan Zhang, Ming Ji

5 0.28571022 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

Author: Nakul Verma

6 0.23926362 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

7 0.22478473 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

8 0.19845694 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

9 0.18926448 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

10 0.1760865 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

11 0.15562186 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

12 0.15207723 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

13 0.15185449 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

14 0.148663 96 jmlr-2013-Regularization-Free Principal Curve Estimation

15 0.14213954 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

16 0.13600612 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

17 0.13421361 106 jmlr-2013-Stationary-Sparse Causality Network Learning

18 0.1201648 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

19 0.11767348 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

20 0.11517224 76 jmlr-2013-Nonparametric Sparsity and Regularization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (5, 0.1), (6, 0.03), (10, 0.101), (20, 0.024), (23, 0.025), (34, 0.011), (41, 0.011), (44, 0.011), (67, 0.352), (68, 0.021), (70, 0.012), (75, 0.07), (85, 0.02), (87, 0.071), (89, 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.73541719 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines

Author: David Picard, Nicolas Thome, Matthieu Cord

Abstract: JKernelMachines is a Java library for learning with kernels. It is primarily designed to deal with custom kernels that are not easily found in standard libraries, such as kernels on structured data. These types of kernels are often used in computer vision or bioinformatics applications. We provide several kernels leading to state of the art classiﬁcation performances in computer vision, as well as various kernels on sets. The main focus of the library is to be easily extended with new kernels. Standard SVM optimization algorithms are available, but also more sophisticated learning-based kernel combination methods such as Multiple Kernel Learning (MKL), and a recently published algorithm to learn powered products of similarities (Product Kernel Learning). Keywords: classiﬁcation, support vector machines, kernel, computer vision

same-paper 2 0.68459189 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

3 0.41518825 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

Abstract: This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. Extensive experiments on several synthetic and real-world data sets demonstrate the competitive empirical performance of our method. Keywords: sparse eigenvalue, power method, sparse principal component analysis, densest k-subgraph

4 0.4114475 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

5 0.41127384 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco

Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efﬁcient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efﬁcient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efﬁcient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Uniﬁed Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, ﬁrst order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efﬁcient parameter selection (closed form expression of the leave one out error and efﬁcient computations of regularization path); (3) natural and efﬁcient extension to multiple outputs. Speciﬁc attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped ﬁles, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simpliﬁed BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have signiﬁcant improvements, which make them faster and more ﬂexible. The speciﬁcation of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identiﬁes an actual learning task, and belongs to a predeﬁned category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all conﬁguration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive conﬁguration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been speciﬁcally designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus speciﬁcally on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate speciﬁcation of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared ﬁlesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, brieﬂy described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) deﬁne a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been ﬁxed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is signiﬁcantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classiﬁcation performance. Note, however, that in GURLS the number of parameter in the grid search is ﬁxed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classiﬁcation. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientiﬁc, 2001. ISBN 981-238-151-1. 3205

6 0.41053602 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

7 0.40960696 86 jmlr-2013-Parallel Vector Field Embedding

8 0.40903458 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

9 0.40821901 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

10 0.40458691 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

11 0.40158141 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

12 0.40131626 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

13 0.39691788 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

14 0.39601114 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

15 0.39447191 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

16 0.39422864 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

17 0.39349684 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

18 0.39330339 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

19 0.3927229 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

20 0.39203718 22 jmlr-2013-Classifying With Confidence From Incomplete Information