nips nips2013 nips2013-314 knowledge-graph by maker-knowledge-mining

314 nips-2013-Stochastic Optimization of PCA with Capped MSG


Source: pdf

Author: Raman Arora, Andy Cotter, Nati Srebro

Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as “Matrix Stochastic Gradient” (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as “Matrix Stochastic Gradient” (MSG), as well as a practical variant, Capped MSG. [sent-4, score-0.152]

2 It is well known that this subspace is the span of the leading k components of the singular value decomposition of the data matrix (or equivalently of the empirical second moment matrix). [sent-10, score-0.122]

3 In this setting, we have some unknown source (“population”) distribution D over Rd , and the goal is to find the k-dimensional subspace maximizing the (uncentered) variance of D inside the subspace (or equivalently, minimizing the average squared residual in the population), based on i. [sent-16, score-0.19]

4 The main point here is that the true objective is not how well the subspace captures the sample (i. [sent-20, score-0.144]

5 the “training error”), but rather how well the subspace captures the underlying source distribution (i. [sent-22, score-0.115]

6 Of course, finding the subspace that best captures the sample is a very reasonable approach to PCA on the population. [sent-28, score-0.115]

7 Such a population-based view of optimization has recently been advocated in machine learning, and has been used to argue for crude stochastic approximation approaches (online-type methods) over sophisticated deterministic optimization of the empirical (training) objective (i. [sent-32, score-0.115]

8 A similar argument was also 1 made in the context of stochastic optimization, where Nemirovski et al. [sent-35, score-0.085]

9 (2009) argues for stochastic approximation (SA) approaches over ERM. [sent-36, score-0.086]

10 We take the same view in order to advocate for, study, and develop stochastic approximation approaches for PCA. [sent-44, score-0.086]

11 In an empirical study of stochastic approximation methods for PCA, a heuristic “incremental” method showed very good empirical performance (Arora et al. [sent-45, score-0.105]

12 However, no theoretical guarantees or justification were given for incremental PCA. [sent-47, score-0.14]

13 Using an onlineto-batch conversion, this online algorithm can be converted to a stochastic approximation algorithm with good iteration complexity, however the runtime for each iteration is essentially the same as that of ERM (i. [sent-51, score-0.25]

14 of PCA on the sample), and thus senseless as a stochastic approximation method (see Section 3. [sent-53, score-0.086]

15 We then present the capped MSG algorithm, which is a more practical variant of MSG, has very similar updates to those of the “incremental” method, works well in practice, and does not get stuck like the “incremental” method. [sent-57, score-0.416]

16 2 Problem Setup We consider PCA as the problem of finding the maximal (uncentered) variance k-dimensional subspace with respect to an (unknown) distribution D over x ∈ Rd . [sent-59, score-0.095]

17 We represent a k-dimensional subspace by an orthonormal basis, collected in the columns of a matrix U . [sent-62, score-0.122]

18 As in other studies of stochastic approximation methods, we are less concerned with the number of required samples, and instead care mostly about the overall runtime required to obtain an suboptimal solution. [sent-69, score-0.205]

19 1 is empirical risk minimization (ERM): given samples {xt }T t=1 T 1 ˆ drawn from D, we compute the empirical covariance matrix C = T t=1 xt xT , and take the t ˆ columns of U to be the eigenvectors of C corresponding to the top-k eigenvalues. [sent-71, score-0.161]

20 3 MSG and MEG A natural stochastic approximation (SA) approach to PCA is projected stochastic gradient descent (SGD) on Problem 2. [sent-74, score-0.213]

21 This leads to the stochastic power method, for which, at each iteration, the following update is performed: U (t+1) = Porth U (t) + ηxt xT t 2 (3. [sent-76, score-0.096]

22 1) Here, xt xT is the gradient of the PCA objective w. [sent-77, score-0.171]

23 Instead of representing a linear subspace in terms of its basis matrix U , we parametrize it using the corresponding projection matrix M = U U T . [sent-87, score-0.218]

24 This prompts us relax the objective by taking the convex hull of the feasible region: maximize : Ex∼D [xT M x] (3. [sent-91, score-0.111]

25 3) d×d subject to : M ∈ R , 0 M I, tr M = k Since the objective is linear, and the feasible regiuon is the convex hull of that of Problem 3. [sent-92, score-0.13]

26 Since the objective is linear, treating the coefficients of the convex combination as defining a discrete distribution, and sampling according to this distribution, yields a rank-k matrix with the desired expected objective function value. [sent-112, score-0.133]

27 4) The projection is now performed onto the (convex) constraints of Problem 3. [sent-119, score-0.121]

28 4) T times, each time using an independent sample xt ∼ D. [sent-125, score-0.11]

29 3 Algorithm 1 Matrix stochastic gradient (MSG) update: compute an eigendecomposition of M +ηxxT from a rank-m eigendecomposition M = U diag(σ )(U )T and project the resulting solution onto the constraint set. [sent-143, score-0.312]

30 In the last inequality, we used the fact that M ∗ has k eigenvalues √ of value 1 each, and hence M ∗ F = k. [sent-149, score-0.098]

31 In this section, we show how to perform this update efficiently by maintaining an up-to-date eigendecomposition of M (t) . [sent-152, score-0.09]

32 Consider the eigendecomposition M (t) = U diag(σ)(U )T at the tth iteration, where rank(M (t) ) = kt and U ∈ Rd×kt . [sent-154, score-0.164]

33 Given a new observation xt , the eigendecomposition of M (t) + ηxt xT can be updated t efficiently using a (kt+1)×(kt+1) SVD (Brand, 2002; Arora et al. [sent-155, score-0.189]

34 This rank-one eigen-update is followed by projection onto the constraints of Problem 3. [sent-157, score-0.121]

35 Let M ∈ Rd×d be a symmetric matrix, with eigenvalues σ1 , . [sent-163, score-0.098]

36 Its projection M = P (M ) onto the feasible region of Problem 3. [sent-170, score-0.158]

37 3 with respect to the Frobenius norm, is the unique feasible matrix which has the same eigenvectors as M , with the associated eigenvalues σ1 , . [sent-171, score-0.186]

38 This result shows that projecting onto the feasible region amounts to finding the value of S such that, after shifting the eigenvalues by S and clipping the results to [0, 1], the result is feasible. [sent-175, score-0.218]

39 It is optimized to efficiently handle repeated eigenvalues—rather than receiving the eigenvalues in a length-d list, it instead receives a length-n list containing only the distinct eigenvalues, with κ containing the corresponding multiplicities. [sent-178, score-0.116]

40 It takes as parameters the dimension d, “target” subspace dimension k, and the number of distinct eigenvalues n of the current iterate. [sent-190, score-0.231]

41 The length-n arrays σ and κ contain the distinct eigenvalues and their multiplicities, respectively, of M (with n κi = d). [sent-191, score-0.098]

42 This is a significant improvement over the O(d2 ) memory and O(d2 ) computation required by a standard implementation of MSG, if the iterates have relatively low rank. [sent-195, score-0.104]

43 3 Matrix Exponentiated Gradient Since M is constrained by its trace, and not by its Frobenius norm, it is tempting to consider mirror descent (MD) (Beck & Teboulle, 2003) instead of SGD updates for solving Problem 3. [sent-197, score-0.128]

44 4 MSG runtime and the rank of the iterates As we saw in Sections 3. [sent-215, score-0.252]

45 2, MSG requires O(k/ 2 ) iterations to obtain an -suboptimal 2 solution, and each iteration costs O(kt d) operations, where kt is the rank of iterate M (t) . [sent-217, score-0.269]

46 Clearly, the runtime for MSG depends ¯ yields a total runtime of O(k t=1 t critically on the rank of the iterates. [sent-219, score-0.256]

47 If kt is as large as d, then MSG achieves a runtime that is cubic in the dimensionality. [sent-220, score-0.191]

48 On the other hand, if the rank of the iterates is O(k), the runtime is linear in the dimensionality. [sent-221, score-0.252]

49 Fortunately, in practice, each kt is typically much lower than d. [sent-222, score-0.104]

50 The reason for this is that the MSG update performs a rank-1 update followed by a projection onto the constraints. [sent-223, score-0.181]

51 Furthermore, the SGD analysis depends on the Frobenius norm of the stochastic gradients, but since all stochastic gradients are rank one, this is the same as their spectral norm, which comes up in the entropy-case analysis, and again there is no benefit. [sent-227, score-0.211]

52 Therefore, each MSG update will increase the rank of the iterate by at most 1, and has the potential to decrease it, perhaps significantly. [sent-230, score-0.153]

53 It’s very difficult to theoretically quantify how the rank of the iterates will evolve over time, but we have observed empirically that the iterates do tend to have relatively low rank. [sent-231, score-0.291]

54 Observe that Σ(k) has a smoothly-decaying set of eigenvalues and the rate of decay is controlled by τ . [sent-237, score-0.098]

55 1 and k ∈ {1, 2, 4}, where k is the desired subspace dimension used by each algorithm. [sent-240, score-0.114]

56 4 eigenvalues of the MSG iterates evolves over 10 time, from which we can see that many of the ex- 5 0. [sent-248, score-0.202]

57 This suggests that constraining kt can Figure 1: The ranks kt (left) and the eigenvalues lead to significant speedups and motivates capped (right) of the MSG iterates M (t) . [sent-250, score-0.75]

58 1 5 2 3 4 1 2 3 4 Capped MSG While, as was observed in the previous section, MSG’s iterates will tend to have ranks kt smaller than d, they will nevertheless also be larger than k. [sent-252, score-0.208]

59 For this reason, we recommend imposing a hard constraint K on the rank of the iterates: maximize : Ex∼D [xT M x] (5. [sent-253, score-0.078]

60 1) d×d subject to : M ∈ R ,0 M I tr M = k, rank M ≤ K We will refer to MSG where the projection is replaced with a projection onto the constraints of Problem 5. [sent-254, score-0.27]

61 where the iterates are SGD iterates on Problem 5. [sent-257, score-0.208]

62 the hard rank-constraint K is strictly larger than the target rank k), then we can easily check if we are at a global optimum of Problem 5. [sent-267, score-0.085]

63 3: if the capped MSG algorithm converges to a solution of rank K, then the upper bound K should be increased. [sent-269, score-0.418]

64 There is thus an advantage in using K > k, and we recommend setting K = k + 1, as we do in our experiments, and increasing K only if a rank deficient solution is not found in a timely manner. [sent-271, score-0.095]

65 If we take K = k, then the only way to satisfy the trace constraint is to have all non-zero eigenvalues equal to one, and Problem 5. [sent-272, score-0.12]

66 3 allows us to increase the search rank K, allowing for more flexibility in the iterates, while still forcing each iterate to be low-rank, and each update to therefore be efficient, through the rank constraint. [sent-276, score-0.196]

67 1 Implementing the projection The only difference between the implementation of MSG and capped MSG is in the projection step. [sent-278, score-0.478]

68 Hence, if we let σ be the vector of the eigenvalues of M , and suppose that more than K of them are nonzero, then there will be a a size-K subset of σ such that applying Algorithm 2 to this set gives the projected eigenvalues. [sent-298, score-0.098]

69 Since we perform only a rank-1 update at every iteration, we must check at most K possibilities, at a total cost of O(K 2 log K) operations, which has no effect on the asymptotic runtime because Algorithm 1 requires O(K 2 d) operations. [sent-299, score-0.117]

70 2 Relationship to the incremental PCA method The capped MSG algorithm with K = k is similar to the incremental algorithm of Arora et al. [sent-301, score-0.639]

71 (2012), which maintains a rank-k approximation of the covariance matrix and updates according to: M (t+1) = Prank-k M (t) + xt xT t where the projection is onto the set of rank-k matrices. [sent-302, score-0.309]

72 Unlike MSG, the incremental algorithm does not have a step-size. [sent-303, score-0.14]

73 In a recent survey of stochastic algorithms for PCA (Arora et al. [sent-306, score-0.085]

74 , 2012), the incremental algorithm was found to perform extremely well in practice–it was the best, in fact, among the compared algorithms. [sent-307, score-0.14]

75 The capped MSG algorithm with K > k will not get stuck in such situations, since it will use the additional dimensions to search in the new direction. [sent-311, score-0.385]

76 This distribution presents a challenging test-case for MSG, capped MSG and the incremental algorithm. [sent-314, score-0.48]

77 Figure 2 shows plots of individual runs of MSG, capped MSG with K = k + 1, the incremental algorithm, and Warmuth and Kuzmin’s algorithm, all based on the same sequence of samples drawn from the orthogonal distribution. [sent-315, score-0.5]

78 We compare algorithms in terms of the suboptimality on the population objective based on the largest k eigenvalues of the state matrix M (t) . [sent-316, score-0.26]

79 The plots show the incremental algorithm getting stuck for k ∈ {1, 4}, and the others intermittently plateauing at intermediate solutions before beginning to again converge rapidly towards the optimum. [sent-317, score-0.185]

80 This behavior is to be expected on the capped MSG algorithm, due to the fact that the dimension of the subspace stored at each iterate is constrained. [sent-318, score-0.498]

81 However, it is somewhat surprising that MSG and Warmuth and Kuzmin’s algorithm behaved similarly, and barely faster than capped MSG. [sent-319, score-0.34]

82 runtime 10 6 10 7 10 8 0 0 10 10 1 10 2 3 10 10 4 5 10 6 10 7 10 10 8 0 0 10 1 10 2 10 Est. [sent-328, score-0.087]

83 runtime 3 10 4 10 5 10 6 10 7 10 8 10 Est. [sent-329, score-0.087]

84 runtime Figure 3: Comparison on the MNIST dataset. [sent-330, score-0.087]

85 The top row of plots shows suboptimality as a function of iteration count, while the bottom row suboptimality as a function of estimated runtime t 2 s=1 (ks ) . [sent-331, score-0.272]

86 In addition to MSG, capped MSG, the incremental algorithm and Warmuth and Kuzmin’s algorithm, we also compare to a Grassmannian SGD algorithm (Balzano et al. [sent-333, score-0.499]

87 All algorithms except the incremental algorithm have a step-size parameter. [sent-335, score-0.14]

88 , 25 } and picked the best c, in terms of the average suboptimality over the run, on a validation set. [sent-339, score-0.078]

89 We are interested in learning a maximum variance subspace of dimension k ∈ {1, 4, 8} in a single “pass” over the training sample. [sent-342, score-0.114]

90 In order to compare MSG, capped MSG, the incremental algorithm and Warmuth and Kuzmin’s algorithm in terms of runtime, we calculate the dominant term t in the computational complexity: s=1(ks )2. [sent-343, score-0.48]

91 We can see from Figure 3 that the incremental algorithm makes the most progress per iteration and is also the fastest of all algorithms. [sent-345, score-0.169]

92 MSG is comparable to the incremental algorithm in terms of the the progress made per iteration. [sent-346, score-0.14]

93 However, its runtime is slightly worse because it will often keep a slightly larger representation (of dimension kt ). [sent-347, score-0.21]

94 The capped MSG variant (with K = k + 1) is significantly faster–almost as fast as the incremental algorithm, while, as we saw in the previous section, being less prone to getting stuck. [sent-348, score-0.48]

95 Inspection of the underlying data shows that, in the k ∈ {4, 8} experiments, it also tends to have a larger kt than MSG in these experiments, and therefore has a higher cost-per-iteration. [sent-350, score-0.104]

96 Grassmannian SGD performs better than Warmuth and Kuzmin’s algorithm, but much worse than MSG and capped MSG. [sent-351, score-0.34]

97 7 Conclusions In this paper, we presented a careful development and analysis of MSG, a stochastic approximation algorithm for PCA, which enjoys good theoretical guarantees and offers a computationally efficient variant, capped MSG. [sent-352, score-0.445]

98 We show that capped MSG is well-motivated theoretically and that it does not get stuck at a suboptimal solution. [sent-353, score-0.439]

99 Capped MSG is also shown to have excellent empirical performance and it therefore is a much better alternative to the recently proposed incremental PCA algorithm of Arora et al. [sent-354, score-0.159]

100 On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. [sent-396, score-0.208]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('msg', 0.741), ('capped', 0.34), ('kuzmin', 0.226), ('warmuth', 0.18), ('pca', 0.144), ('incremental', 0.14), ('xt', 0.11), ('kt', 0.104), ('iterates', 0.104), ('eigenvalues', 0.098), ('subspace', 0.095), ('runtime', 0.087), ('sgd', 0.086), ('suboptimality', 0.078), ('arora', 0.073), ('ex', 0.073), ('projection', 0.069), ('mirror', 0.068), ('stochastic', 0.066), ('erm', 0.063), ('rank', 0.061), ('eigendecomposition', 0.06), ('onto', 0.052), ('srebro', 0.05), ('meg', 0.05), ('xxt', 0.047), ('sanger', 0.046), ('stuck', 0.045), ('iterate', 0.044), ('exponentiated', 0.041), ('karhunen', 0.041), ('cj', 0.04), ('nemirovski', 0.038), ('feasible', 0.037), ('shai', 0.036), ('oja', 0.035), ('cotter', 0.034), ('grassmannian', 0.034), ('gradient', 0.032), ('tewari', 0.032), ('suboptimal', 0.032), ('updates', 0.031), ('iterations', 0.031), ('arkadi', 0.031), ('clipped', 0.031), ('clipping', 0.031), ('dkt', 0.031), ('eig', 0.031), ('porth', 0.031), ('yudin', 0.031), ('update', 0.03), ('rounding', 0.03), ('descent', 0.029), ('objective', 0.029), ('iteration', 0.029), ('population', 0.028), ('teboulle', 0.027), ('uncentered', 0.027), ('multiplicities', 0.027), ('convex', 0.027), ('matrix', 0.027), ('sj', 0.025), ('brand', 0.025), ('sa', 0.025), ('frobenius', 0.025), ('project', 0.025), ('optimum', 0.024), ('eigenvectors', 0.024), ('rd', 0.024), ('diag', 0.024), ('ci', 0.023), ('operations', 0.023), ('svd', 0.023), ('raman', 0.022), ('duchi', 0.022), ('theoretically', 0.022), ('trace', 0.022), ('allerton', 0.021), ('yields', 0.021), ('balzano', 0.021), ('orthogonal', 0.02), ('approximation', 0.02), ('captures', 0.02), ('collins', 0.02), ('online', 0.019), ('et', 0.019), ('dimension', 0.019), ('yoram', 0.019), ('tr', 0.019), ('careful', 0.019), ('si', 0.018), ('ks', 0.018), ('il', 0.018), ('norm', 0.018), ('list', 0.018), ('potential', 0.018), ('hull', 0.018), ('pseudocode', 0.017), ('recommend', 0.017), ('solution', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

Author: Raman Arora, Andy Cotter, Nati Srebro

Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as “Matrix Stochastic Gradient” (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. 1

2 0.16969554 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

3 0.14574605 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Specifically, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components specific to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efficient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and significantly outperform both standard PCA and robust PCA with rank constraints. 1

4 0.11455495 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1

5 0.10648987 91 nips-2013-Dirty Statistical Models

Author: Eunho Yang, Pradeep Ravikumar

Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1

6 0.10046844 324 nips-2013-The Fast Convergence of Incremental PCA

7 0.10028332 233 nips-2013-Online Robust PCA via Stochastic Optimization

8 0.085623316 224 nips-2013-On the Sample Complexity of Subspace Learning

9 0.082165517 232 nips-2013-Online PCA for Contaminated Data

10 0.063654035 137 nips-2013-High-Dimensional Gaussian Process Bandits

11 0.062978759 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

12 0.062226206 89 nips-2013-Dimension-Free Exponentiated Gradient

13 0.058351986 269 nips-2013-Regression-tree Tuning in a Streaming Setting

14 0.058102172 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

15 0.057374097 20 nips-2013-Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

16 0.057064779 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

17 0.053073198 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

18 0.052707173 348 nips-2013-Variational Policy Search via Trajectory Optimization

19 0.051321208 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

20 0.05018289 257 nips-2013-Projected Natural Actor-Critic


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.146), (1, 0.023), (2, 0.105), (3, 0.039), (4, -0.046), (5, 0.015), (6, -0.079), (7, 0.122), (8, -0.056), (9, -0.098), (10, -0.079), (11, 0.055), (12, 0.073), (13, 0.113), (14, 0.01), (15, 0.022), (16, -0.005), (17, 0.069), (18, 0.06), (19, 0.031), (20, 0.087), (21, -0.037), (22, 0.022), (23, -0.038), (24, 0.017), (25, 0.03), (26, 0.025), (27, 0.058), (28, -0.04), (29, 0.024), (30, 0.009), (31, -0.04), (32, -0.026), (33, 0.003), (34, 0.037), (35, -0.03), (36, -0.033), (37, 0.023), (38, -0.015), (39, -0.05), (40, -0.015), (41, -0.02), (42, 0.023), (43, -0.006), (44, 0.012), (45, 0.039), (46, -0.075), (47, 0.046), (48, 0.004), (49, -0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92283762 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

Author: Raman Arora, Andy Cotter, Nati Srebro

Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as “Matrix Stochastic Gradient” (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. 1

2 0.79451907 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Specifically, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components specific to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efficient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and significantly outperform both standard PCA and robust PCA with rank constraints. 1

3 0.78807712 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

4 0.74315554 233 nips-2013-Online Robust PCA via Stochastic Optimization

Author: Jiashi Feng, Huan Xu, Shuicheng Yan

Abstract: Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust PCA (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the number of samples, significantly enhancing the computation and storage efficiency. The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods. 1

5 0.70163566 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1

6 0.68678463 232 nips-2013-Online PCA for Contaminated Data

7 0.61416006 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

8 0.60013801 224 nips-2013-On the Sample Complexity of Subspace Learning

9 0.5571472 324 nips-2013-The Fast Convergence of Incremental PCA

10 0.52731949 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

11 0.52023411 91 nips-2013-Dirty Statistical Models

12 0.51259965 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

13 0.5067966 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

14 0.49615481 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

15 0.48714933 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

16 0.46131533 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces

17 0.44895688 247 nips-2013-Phase Retrieval using Alternating Minimization

18 0.44215718 256 nips-2013-Probabilistic Principal Geodesic Analysis

19 0.4343127 225 nips-2013-One-shot learning and big data with n=2

20 0.43361816 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.022), (16, 0.033), (33, 0.137), (34, 0.091), (36, 0.064), (41, 0.052), (49, 0.027), (56, 0.103), (68, 0.198), (70, 0.032), (85, 0.048), (89, 0.029), (93, 0.027), (95, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81250632 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

Author: Raman Arora, Andy Cotter, Nati Srebro

Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as “Matrix Stochastic Gradient” (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. 1

2 0.77789891 92 nips-2013-Discovering Hidden Variables in Noisy-Or Networks using Quartet Tests

Author: Yacine Jernite, Yonatan Halpern, David Sontag

Abstract: We give a polynomial-time algorithm for provably learning the structure and parameters of bipartite noisy-or Bayesian networks of binary variables where the top layer is completely hidden. Unsupervised learning of these models is a form of discrete factor analysis, enabling the discovery of hidden variables and their causal relationships with observed data. We obtain an efficient learning algorithm for a family of Bayesian networks that we call quartet-learnable. For each latent variable, the existence of a singly-coupled quartet allows us to uniquely identify and learn all parameters involving that latent variable. We give a proof of the polynomial sample complexity of our learning algorithm, and experimentally compare it to variational EM. 1

3 0.71708113 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

4 0.71503866 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

Abstract: Entropy rate quantifies the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1

5 0.71296191 317 nips-2013-Streaming Variational Bayes

Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan

Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1

6 0.71015072 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

7 0.70424128 11 nips-2013-A New Convex Relaxation for Tensor Completion

8 0.70410681 233 nips-2013-Online Robust PCA via Stochastic Optimization

9 0.7018308 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

10 0.70151585 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

11 0.69964266 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

12 0.69954216 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic

13 0.69857264 318 nips-2013-Structured Learning via Logistic Regression

14 0.6979003 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

15 0.69769758 350 nips-2013-Wavelets on Graphs via Deep Learning

16 0.69739592 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models

17 0.69728827 249 nips-2013-Polar Operators for Structured Sparse Estimation

18 0.69695795 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

19 0.6966992 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

20 0.6966933 173 nips-2013-Least Informative Dimensions