nips nips2013 nips2013-224 knowledge-graph by maker-knowledge-mining

224 nips-2013-On the Sample Complexity of Subspace Learning

Source: pdf

Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco

Abstract: A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. [sent-6, score-0.596]

2 Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribution, and hold for a wide class of metrics between subspaces. [sent-8, score-0.221]

3 As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. [sent-9, score-0.286]

4 Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods. [sent-10, score-0.14]

5 1 Introduction The subspace learning problem is that of ﬁnding the smallest linear space supporting data drawn from an unknown distribution. [sent-11, score-0.301]

6 It is a classical problem in machine learning and statistics, with several established algorithms addressing it, most notably PCA and kernel PCA [12, 18]. [sent-12, score-0.137]

7 It is also at the core of a number of spectral methods for data analysis, including spectral embedding methods, from classical multidimensional scaling (MDS) [7, 26], to more recent manifold embedding methods [22, 16, 2], and spectral methods for support estimation [9]. [sent-13, score-0.619]

8 Therefore knowledge of the speed of convergence of the subspace learning problem, with respect to the sample size, and the algorithms’ parameters, is of considerable practical importance. [sent-14, score-0.301]

9 Given a measure ρ from which independent samples are drawn, we aim to estimate the smallest subspace Sρ that contains the support of ρ. [sent-15, score-0.404]

10 In some cases, the support may lie on, or close to, a subspace of lower dimension than the embedding space, and it may be of interest to learn such a subspace Sρ in order to replace the original samples by their local encoding with respect to Sρ . [sent-16, score-0.793]

11 They embed the data in a suitable Hilbert space H, and compute a linear subspace that best approximates the embedded data. [sent-19, score-0.329]

12 The local coordinates in this subspace then become the new representation space. [sent-20, score-0.301]

13 Similar spectral techniques may also be used to estimate the support of the data itself, as discussed in [9]. [sent-21, score-0.179]

14 1 While the subspace estimates are derived from the available samples only, or their embedding, the learning problem is concerned with the quality of the computed subspace as an estimate of Sρ (the true span of the support of ρ). [sent-22, score-0.817]

15 We begin by deﬁning the subspace learning problem (Sec. [sent-24, score-0.325]

16 Our main technical contribution is a general learning rate for the subspace learning problem, which is then particularized to common instances of this problem (Sec. [sent-27, score-0.355]

17 Our proofs use novel tools from linear operator theory to obtain learning rates for the subspace learning problem which are signiﬁcantly sharper than existing ones, under typical assumptions, but also cover a wider range of performance metrics. [sent-29, score-0.463]

18 samples Xn = {xi }1≤i≤n , the smallest linear subspace Sρ := span(M ) that contains M . [sent-36, score-0.301]

19 ˆ The quality of an estimate S of Sρ , for a given metric (or error criterion) d, is characterized in terms of probabilistic bounds of the form ˆ P d(Sρ , S) ≤ ε(δ, n, ρ) ≥ 1 − δ, 0 < δ ≤ 1. [sent-37, score-0.131]

20 2 In the remainder the metric projection operator onto a subspace S is denoted by PS , where PS = ∗ PS = PS (every P is idempotent and self-adjoint). [sent-40, score-0.395]

21 1 Subspace estimates Letting C := Ex∼ρ x ⊗ x be the (uncentered) covariance operator associated to ρ, it is easy to show n 1 that Sρ = Ran C. [sent-44, score-0.176]

22 Similarly, given the empirical covariance Cn := n i=1 x ⊗ x, we deﬁne the empirical subspace estimate ˆ Sn := span(Xn ) = Ran Cn ˆ (note that the closure is not needed in this case because Sn is ﬁnite-dimensional). [sent-45, score-0.507]

23 We also deﬁne k k k ˆn := Ran Cn , where Cn is obtained from Cn by the k-truncated (kernel) PCA subspace estimate S k ˆn is spanned by the top k keeping only its k top eigenvalues. [sent-46, score-0.348]

24 Note that, since the PCA estimate S ˆk ˆk ˆk eigenvectors of Cn , then clearly Sn ⊆ Sn for k < k , and therefore {Sn }n is a nested family of k=1 subspaces (all of which are contained in Sρ ). [sent-47, score-0.186]

25 1, since kernel-PCA reduces to regular PCA in a feature space [18] (and can be computed with knowledge of the kernel alone), the following discussion applies equally to kernel-PCA estimates, with the understanding that, in that case, Sρ is the span of the support of ρ in the feature space. [sent-49, score-0.245]

26 2 Performance criteria In order for a bound of the form of Equation (1) to be meaningful, a choice of performance criteria d must be made. [sent-51, score-0.11]

27 We deﬁne the distance dα,p (U, V ) := (PU − PV )C α p (2) between subspaces U, V , which is a metric over the space of subspaces contained in Sρ , for 0 ≤ 1 α ≤ 2 and 1 ≤ p ≤ ∞. [sent-52, score-0.256]

28 In particular, the so-called reconstruction error [13]: ˆ dR (Sρ , S) := Ex∼ρ PS (x) − P ˆ (x) 2 ρ H S is dR (Sρ , ·) = d1/2,2 (Sρ , ·)2 . [sent-55, score-0.123]

29 Note that dR is a natural criterion because a k-truncated PCA estimate minimizes a suitable error ˆ ˆ dR over all subspaces of dimension k. [sent-56, score-0.189]

30 (1), for the k-truncated PCA estimate ˆk ˆ ˆn Sn (with the empirical estimate Sn := Sn being a particular case), whose proof is postponed to ˆk Sec. [sent-62, score-0.133]

31 We begin by bounding the distance dα,p between Sρ and the k-truncated PCA estimate Sn , given a known covariance C. [sent-64, score-0.206]

32 according to a probability measure ρ supported on the unit ball of a separable Hilbert space H, with covariance C. [sent-70, score-0.108]

33 δ We say that C has eigenvalue decay rate of order r if there are constants q, Q > 0 such that qj −r ≤ σj ≤ Qj −r , where σj are the (decreasingly ordered) eigenvalues of C, and r > 1. [sent-75, score-0.437]

34 From Equation (2) it is clear that, in order for the subspace learning problem to be well-deﬁned, it must be C α p < ∞, or alternatively: αp > 1/r. [sent-76, score-0.301]

35 Note that this condition is always met for p = ∞, and also holds in the reconstruction error case (α = 1/2, p = 2), for any decay rate r > 1. [sent-77, score-0.4]

36 Knowledge of an eigenvalue decay rate can be incorporated into Theorem 3. [sent-78, score-0.357]

37 1, it is, with probability 1 − δ 1 ˆk dα,p (Sρ , Sn ) ≤ ∗ where it is kn = qn 9 log(n/δ) 1/r Q k −rα+ p 1 ∗ Q kn −rα+ p ∗ if k < kn ∗ if k ≥ kn (polynomial decay) (plateau) , and Q = 3 Q1/r Γ(αp − 1/r)Γ(1 + 1/r)/Γ(1/r) (4) 1/p . [sent-84, score-1.18]

38 ∗ The above theorem guarantees a drop in dα,p with increasing k, at a rate of k −rα+1/p , up to k = kn , ∗ after which the bound remains constant. [sent-85, score-0.469]

39 The estimated plateau threshold k is thus the value of truncation past which the upper bound does not improve. [sent-86, score-0.268]

40 Note that, as described in Section 5, this performance drop and plateau behavior is observed in practice. [sent-87, score-0.178]

41 ∗ ˆk If we consider an algorithm that produces, for each set of n samples, an estimate Sn with k ≥ kn ∗ then, by plugging the deﬁnition of kn into Eq. [sent-95, score-0.615]

42 Let C have eigenvalue decay rate of order r, and Q , kn be as in Theorem 3. [sent-99, score-0.641]

43 Let ∗ ∗ ˆn be a truncated subspace estimate Sn with k ≥ kn . [sent-101, score-0.632]

44 Note that, by setting k = n, the above corollary also provides guarantees on the rate of convergence of the empirical estimate Sn = span(Xn ) to Sρ , of order log n − log δ n dα,p (Sρ , Sn ) = O 1 α− rp . [sent-104, score-0.215]

45 4 are valid for all n such that kn ≤ n (or equivalently such that nr−1 (log n − log δ) ≥ q/9). [sent-107, score-0.284]

46 Note that, because ρ is supported on the unit ball, its covariance has eigenvalues no greater than one, and therefore it must be q < 1. [sent-108, score-0.133]

47 It thus sufﬁces to require that ∗ n > 3 to ensure the condition kn ≤ n to hold. [sent-109, score-0.284]

48 4 Applications of subspace learning We describe next some of the main uses of subspace learning in the literature. [sent-110, score-0.602]

49 1 Kernel PCA and embedding methods One of the main applications of subspace learning is in reducing the dimensionality of the input. [sent-112, score-0.48]

50 In particular, one may ﬁnd nested subspaces of dimension 1 ≤ k ≤ n that minimize the distances from the original to the projected samples. [sent-113, score-0.139]

51 In particular, the above procedure amounts to computing an eigen-decomposition of the empirical covariance (Sec. [sent-115, score-0.12]

52 1): n σi ui ⊗ ui , Cn = i=1 k ˆk where the k-th subspace estimate is Sn := Ran Cn = span{ui : 1 ≤ i ≤ k}. [sent-117, score-0.402]

53 Note that, in the general case of kernel PCA, we assume the samples {xi }1≤i≤n to be in some RKHS H, which are obtained from the observed variables (z1 , . [sent-118, score-0.108]

54 , zn ) ∈ Z n , for some space Z, through an embedding xi := φ(zi ). [sent-121, score-0.167]

55 Typically, due to the very high dimensionality of H, we may only have indirect information about φ in the form a kernel function K : Z × Z → R: a symmetric, positive deﬁnite function satisfying K(z, w) = φ(z), φ(w) H [20] (for technical reasons, we also assume K to be continuous). [sent-122, score-0.152]

56 120–121], whereas, given K, the embedding φ is only unique up to an inner product-preserving transformation. [sent-124, score-0.135]

57 Given a point z ∈ Z, we can make use of K to compute the coordinates of the projection of its ˆk embedding φ(z) onto Sn ⊆ H by means of a simple k-truncated eigen-decomposition of Kn . [sent-125, score-0.135]

58 ˆk It is easy to see that the k-truncated kernel PCA subspace Sn minimizes the empirical reconstruction ˆ ˆ ˆ error dR (Sn , S), among all subspaces S of dimension k. [sent-126, score-0.687]

59 ˆk ˆ ˆ Since we are interested in the expected dR (Sρ , Sn ) (rather than the empirical dR (Sn , S)) error of the kernel PCA estimate, we may obtain a learning rate for Equation 5 by particularizing Theorem 3. [sent-128, score-0.227]

60 In particular, recalling that dR (Sρ , ·) = dα,p (Sρ , ·)2 with α = 1/2 and p = 2, ∗ and choosing a value of k ≥ kn that minimizes the bound of Theorem 3. [sent-132, score-0.366]

61 Let C have eigenvalue decay rate of ˆ∗ order r, and Sn be as in Corollary 3. [sent-136, score-0.357]

62 2 Support estimation The problem of support estimation consists in recovering the support M of a distribution ρ on a metric space Z from identical and independent samples Zn = (zi )1≤i≤n . [sent-140, score-0.142]

63 We brieﬂy recall a recently proposed approach to support estimation based on subspace learning [9], and discuss how our results specialize to this setting, producing a qualitative improvement to theirs. [sent-141, score-0.357]

64 Given a suitable reproducing kernel K on Z (with associated feature map φ), the support M can be characterized in terms of the subspace Sρ = span φ(M ) ⊆ H [9]. [sent-142, score-0.546]

65 More precisely, letting dV (x) = x − PV x H be the point-subspace distance to a subspace V , it can be shown (see [9]) that, if the kernel separates 1 M , then it is M = {z ∈ Z | dSρ (φ(z)) = 0}. [sent-143, score-0.49]

66 ˆ ˆ This suggests an empirical estimate M = {z ∈ Z | dS (φ(z)) ≤ τ } of M , where S = span φ(Zn ), ˆ ˆ and τ > 0. [sent-144, score-0.167]

67 More precisely, if the eigenfunctions of the covariance operator C = Ez∼ρ [φ(z) ⊗ φ(z)] are uniformly bounded, then it sufﬁces for Hausdorff convergence to bound from above d r−1 ,∞ (where r > 1 is the eigenvalue decay rate of C). [sent-146, score-0.591]

68 0 Figure 1: The ﬁgure shows the experimental beˆ havior of the distance dα,∞ (S k , Sρ ) between the empirical and the actual support subspaces, with respect to the regularization parameter. [sent-153, score-0.149]

69 Here the actual subspace is analytically computed, while the empirical one is computed on a dataset with n = 1000 and 32bit ﬂoating point precision. [sent-155, score-0.371]

70 0 0 10 Letting α = 10 1 r−1 2r 10 2 k 10 3 above yields a high probability bound of order O n− r−1 − 2(3r−1) factors), which is considerably sharper than the bound O n r−1 2r (up to logarithmic found in [8] (Theorem 7). [sent-162, score-0.153]

71 1 A kernel is said to separate M if its associated feature map φ satisﬁes φ−1 (span φ(M )) = M (e. [sent-163, score-0.108]

72 Using the reproducing property of K, it can be shown that, ˆ ˆ ˆ for z ∈ Z, it is d ˆk (φ(z)) = K(z, z) − tz , (K k )† tz where (tz )i = K(z, zi ), Kn is the Gram S n ˆ ˆk ˆ ˆk matrix (Kn )ij = K(zi , zj ), Kn is the rank-k approximation of Kn , and (Kn )† is the pseudo-inverse k ˆ ˆ of Kn . [sent-175, score-0.178]

73 5 Experiments Figure 2: The spectrum of the empirical covariance (left), and the expected distance from a random sample to the empirical k-truncated kernel-PCA subspace estimate (right), as a function of k (n = ∗ 1000, 1000 trials shown in a boxplot). [sent-179, score-0.603]

74 2) is a good estimate of the value k past which the distance stabilizes. [sent-181, score-0.129]

75 Given n samples drawn from ρ, we compute its empirical covariance in H (whose spectrum is ˆk plotted in Figure 2 (left)), and truncate its eigen-decomposition to obtain a subspace estimate Sn , as described in Section 2. [sent-185, score-0.51]

76 ˆk Figure 2 (right) is a box plot of reconstruction error dR (Sρ , Sn ) associated with the k-truncated k ˆn (the expected distance in H of samples to Sn ), with n = 1000 and varying ˆk kernel-PCA estimate S ˆk k. [sent-187, score-0.224]

77 Notice from the ﬁgure that, as pointed out in [6] and ˆk discussed in Section 6, the reconstruction error dR (Sρ , Sn ) is always a non-increasing function of k, ˆk ˆk due to the fact that the kernel-PCA estimates are nested: Sn ⊂ Sn for k < k (see Section 2. [sent-189, score-0.188]

78 The graph is highly concentrated around a curve with a steep intial drop, until reaching some sufﬁciently high k, past which the reconstruction (pseudo) distance becomes stable, and does not vanish. [sent-191, score-0.222]

79 In our experiments, this behavior is typical for the reconstruction distance and high-dimensional problems. [sent-192, score-0.151]

80 Due to the simple form of this example, we are able to compute analytically the spectrum of the true covariance C. [sent-193, score-0.154]

81 In this case, the eigenvalues of C decay as 2γ/((kπ)2 + γ 2 ), with k ∈ N, and therefore they have a polynomial decay rate r = 2 (see Section 3). [sent-194, score-0.542]

82 Given the known spectrum decay ∗ rate, we can estimate the plateau threshold k = kn in the bound of Theorem 3. [sent-195, score-0.786]

83 2, which can be seen 6 ˆk to be a good approximation of the observed start of a plateau in dR (Sρ , Sn ) (Figure 2, right). [sent-196, score-0.135]

84 1) similarly predicts a steep performance drop until the ∗ threshold k = kn (indicated in the ﬁgure by the vertical blue line), and a plateau afterwards. [sent-198, score-0.533]

85 The ˆk plot shows the polynomial decay rate c of the high probability bound dR (Sρ , Sn ) = O(n−c ), as a ∗ function of the eigenvalue decay rate r of the covariance C, computed at the best value kn (which minimizes the bound). [sent-200, score-1.096]

86 2 4 6 8 r 10 Figure 3: Known upper bounds for the polynomial decay rate c (for the best choice of k), for the expected distance from a random sample to the empirical k-truncated kernel-PCA estimate, as a function of the covariance eigenvalue decay rate (higher is better). [sent-205, score-0.851]

87 The learning rate exponent c, under a polynomial eigenvalue decay assumption of the data covaris(r−1) r−1 ance C, is c = r−s+sr for [6] and c = 2r−1 for [19], where s is related to the fourth moment. [sent-208, score-0.397]

88 1, the subspace estimates Sn ˆk ⊆ S k for k < k ), the distance dα,p (Sρ , S k ), and in particular ˆ ˆ are nested for increasing k (i. [sent-214, score-0.439]

89 Sn n n ˆk the reconstruction error dR (Sρ , Sn ), is a non-increasing function of k. [sent-216, score-0.123]

90 Interestingly, however, both in practice (Section 5), and in theory (Section 3), we observe that a typical behavior for the subspace learning problem in high dimensions (e. [sent-219, score-0.301]

91 kernel PCA) is that there is ∗ a certain value of k = kn , past which performance plateaus. [sent-221, score-0.42]

92 Given an empirical covariance operator Cn , we will consider the truncated version rλ (Cn ) where, in this notation, rλ is applied to the eigenvalues of Cn , that is, rλ (Cn ) has the same eigen-structure as Cn , but its eigenvalues that are less or equal to λ are clamped to zero. [sent-226, score-0.288]

93 In order to prove the bound of Equation (3), we begin by proving a more general upper bound of ˆk dα,p (Sρ , Sn ), which is split into a random (A), and a deterministic part (B, C). [sent-227, score-0.128]

94 In cases in which some knowledge of the decay rate of C is available, Lemma 7. [sent-248, score-0.252]

95 1 is simply a particular case for the reconstruction error dR (Sρ , ·) = dα,p (Sρ , ·)2 , with α = 1/2, p = 2. [sent-253, score-0.123]

96 As noted in Section 3, looser bounds would be obtained if classical Bernstein inequalities in Hilbert spaces [14] were used instead. [sent-254, score-0.128]

97 For instance, for p = 2, α = 1/2, and a decay rate r = 2 (as in the example of Section 5), it would be: d1/2,2 (Sρ , Sn ) = O(n−1/4 ) using Theorem 3. [sent-259, score-0.252]

98 Hessian eigenmaps: Locally linear embedding techniques for highdimensional data. [sent-318, score-0.135]

99 A kernel view of the dimensionality reduction of manifolds. [sent-326, score-0.152]

100 On a connection between kernel pca and metric multidimensional scaling. [sent-422, score-0.367]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sn', 0.518), ('subspace', 0.301), ('kn', 0.284), ('dr', 0.268), ('cn', 0.255), ('decay', 0.198), ('pca', 0.193), ('ps', 0.144), ('plateau', 0.135), ('embedding', 0.135), ('eigenmaps', 0.122), ('kernel', 0.108), ('eigenvalue', 0.105), ('reconstruction', 0.097), ('hilbert', 0.09), ('subspaces', 0.086), ('ti', 0.082), ('span', 0.081), ('covariance', 0.081), ('spectral', 0.076), ('corollary', 0.075), ('mds', 0.073), ('tz', 0.073), ('lorenzo', 0.064), ('operator', 0.064), ('rosasco', 0.056), ('support', 0.056), ('rate', 0.054), ('distance', 0.054), ('nested', 0.053), ('bound', 0.052), ('ex', 0.052), ('eigenvalues', 0.052), ('purple', 0.051), ('sharper', 0.049), ('hausdorff', 0.049), ('istituto', 0.049), ('paiement', 0.049), ('tecnologia', 0.049), ('hs', 0.047), ('estimate', 0.047), ('qn', 0.044), ('dimensionality', 0.044), ('italiano', 0.043), ('isomap', 0.043), ('looser', 0.043), ('ernesto', 0.043), ('vito', 0.043), ('steep', 0.043), ('drop', 0.043), ('spectrum', 0.042), ('bernstein', 0.042), ('rkhs', 0.04), ('landau', 0.04), ('polynomial', 0.04), ('empirical', 0.039), ('ds', 0.038), ('eigenfunctions', 0.037), ('multidimensional', 0.036), ('theorem', 0.036), ('gu', 0.035), ('instability', 0.034), ('pointed', 0.034), ('ran', 0.033), ('zn', 0.032), ('zi', 0.032), ('semide', 0.031), ('analytically', 0.031), ('estimates', 0.031), ('unfolding', 0.03), ('pv', 0.03), ('alessandro', 0.03), ('metric', 0.03), ('minimizes', 0.03), ('remark', 0.029), ('classical', 0.029), ('criteria', 0.029), ('past', 0.028), ('di', 0.028), ('inequalities', 0.028), ('threshold', 0.028), ('embed', 0.028), ('manifolds', 0.028), ('qj', 0.028), ('bounds', 0.028), ('principal', 0.028), ('ball', 0.027), ('letting', 0.027), ('separating', 0.027), ('ui', 0.027), ('gram', 0.026), ('error', 0.026), ('lemma', 0.026), ('unsupervised', 0.025), ('tools', 0.025), ('holds', 0.025), ('truncation', 0.025), ('begin', 0.024), ('inversion', 0.024), ('proofs', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 224 nips-2013-On the Sample Complexity of Subspace Learning

Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco

2 0.23425335 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

Author: Martin Azizyan, Aarti Singh, Larry Wasserman

Abstract: While several papers have investigated computationally and statistically efﬁcient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efﬁcient procedure. Our results provide the ﬁrst step of a theoretical basis for recent methods that combine feature selection and clustering. 1

3 0.20176381 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efﬁciently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1

4 0.1618557 137 nips-2013-High-Dimensional Gaussian Process Bandits

Author: Josip Djolonga, Andreas Krause, Volkan Cevher

Abstract: Many applications in machine learning require optimizing unknown functions deﬁned over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Conﬁdence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the ﬁrst subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difﬁcult scenarios. 1

5 0.15538742 91 nips-2013-Dirty Statistical Models

Author: Eunho Yang, Pradeep Ravikumar

Abstract: We provide a uniﬁed framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the inﬁmal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our uniﬁed framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1

6 0.15283081 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

7 0.14495356 233 nips-2013-Online Robust PCA via Stochastic Optimization

8 0.13059512 188 nips-2013-Memory Limited, Streaming PCA

9 0.12755914 173 nips-2013-Least Informative Dimensions

10 0.12652503 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

11 0.12301978 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

12 0.11501167 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

13 0.10628391 232 nips-2013-Online PCA for Contaminated Data

14 0.099577926 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

15 0.095354952 300 nips-2013-Solving the multi-way matching problem by permutation synchronization

16 0.095254987 268 nips-2013-Reflection methods for user-friendly submodular optimization

17 0.092652343 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

18 0.091375522 158 nips-2013-Learning Multiple Models via Regularized Weighting

19 0.090484396 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

20 0.085623316 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, 0.084), (2, 0.129), (3, 0.088), (4, -0.04), (5, 0.046), (6, -0.057), (7, 0.031), (8, -0.208), (9, -0.052), (10, -0.08), (11, 0.14), (12, 0.026), (13, 0.135), (14, 0.19), (15, 0.091), (16, 0.048), (17, 0.034), (18, -0.047), (19, -0.071), (20, 0.067), (21, -0.013), (22, 0.008), (23, -0.032), (24, -0.009), (25, 0.008), (26, -0.001), (27, 0.161), (28, -0.07), (29, 0.016), (30, -0.071), (31, -0.039), (32, 0.02), (33, -0.039), (34, 0.003), (35, 0.069), (36, 0.049), (37, 0.064), (38, 0.109), (39, 0.016), (40, 0.059), (41, 0.069), (42, -0.059), (43, -0.014), (44, 0.035), (45, 0.091), (46, 0.003), (47, 0.042), (48, -0.015), (49, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96057719 224 nips-2013-On the Sample Complexity of Subspace Learning

Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco

2 0.73878455 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng

Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1

3 0.72697723 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

4 0.62894183 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

Author: Martin Azizyan, Aarti Singh, Larry Wasserman

5 0.6272729 137 nips-2013-High-Dimensional Gaussian Process Bandits

Author: Josip Djolonga, Andreas Krause, Volkan Cevher

6 0.59597433 91 nips-2013-Dirty Statistical Models

7 0.579997 188 nips-2013-Memory Limited, Streaming PCA

8 0.57056749 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

9 0.54736304 256 nips-2013-Probabilistic Principal Geodesic Analysis

10 0.54628515 233 nips-2013-Online Robust PCA via Stochastic Optimization

11 0.53547913 284 nips-2013-Robust Spatial Filtering with Beta Divergence

12 0.53475237 173 nips-2013-Least Informative Dimensions

13 0.50814474 324 nips-2013-The Fast Convergence of Incremental PCA

14 0.47766414 158 nips-2013-Learning Multiple Models via Regularized Weighting

15 0.47702932 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

16 0.47034633 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures

17 0.46636981 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

18 0.46137851 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

19 0.46027839 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends

20 0.4588857 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.015), (16, 0.03), (19, 0.017), (30, 0.174), (33, 0.124), (34, 0.133), (41, 0.043), (49, 0.031), (56, 0.147), (70, 0.013), (85, 0.066), (89, 0.062), (93, 0.029), (95, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89832252 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

Author: Shai Shalev-Shwartz, Tong Zhang

Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1

same-paper 2 0.85978132 224 nips-2013-On the Sample Complexity of Subspace Learning

Author: Alessandro Rudi, Guillermo D. Canas, Lorenzo Rosasco

3 0.8482002 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

Author: Paul Valiant, Gregory Valiant

Abstract: Recently, Valiant and Valiant [1, 2] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Speciﬁcally, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n). We propose a novel modiﬁcation of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to ﬁrst use the sample to characterize the “unseen” portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. 1

4 0.81120598 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

Author: Shouyuan Chen, Michael R. Lyu, Irwin King, Zenglin Xu

Abstract: Tensor completion from incomplete observations is a problem of signiﬁcant practical interest. However, it is unlikely that there exists an efﬁcient algorithm with provable guarantee to recover a general tensor from a limited number of observations. In this paper, we study the recovery algorithm for pairwise interaction tensors, which has recently gained considerable attention for modeling multiple attribute data due to its simplicity and effectiveness. Speciﬁcally, in the absence of noise, we show that one can exactly recover a pairwise interaction tensor by solving a constrained convex program which minimizes the weighted sum of nuclear norms of matrices from O(nr log2 (n)) observations. For the noisy cases, we also prove error bounds for a constrained convex program for recovering the tensors. Our experiments on the synthetic dataset demonstrate that the recovery performance of our algorithm agrees well with the theory. In addition, we apply our algorithm on a temporal collaborative ﬁltering task and obtain state-of-the-art results. 1

5 0.80798465 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

6 0.8040396 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

7 0.79980975 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

8 0.79802614 184 nips-2013-Marginals-to-Models Reducibility

9 0.79445815 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

10 0.79344171 249 nips-2013-Polar Operators for Structured Sparse Estimation

11 0.79137003 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

12 0.7911762 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

13 0.79059762 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

14 0.79031545 280 nips-2013-Robust Data-Driven Dynamic Programming

15 0.79006559 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

16 0.78920811 63 nips-2013-Cluster Trees on Manifolds

17 0.78765428 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

18 0.78720558 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

19 0.78717297 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

20 0.78696752 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes