nips nips2013 nips2013-188 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain
Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. [sent-6, score-0.239]
2 Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. [sent-7, score-0.157]
3 Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. [sent-9, score-0.151]
4 It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. [sent-11, score-0.08]
5 Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). [sent-12, score-0.299]
6 Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. [sent-13, score-0.151]
7 We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. [sent-14, score-0.467]
8 While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. [sent-15, score-0.382]
9 1 Introduction Principal component analysis is a fundamental tool for dimensionality reduction, clustering, classification, and many more learning tasks. [sent-16, score-0.08]
10 The core computational element of PCA is performing a (partial) singular value decomposition, and much work over the last half century has focused on efficient algorithms (e. [sent-18, score-0.083]
11 The recent focus on understanding high-dimensional data, where the dimensionality of the data scales together with the number of available sample points, has led to an exploration of the sample complexity of covariance estimation. [sent-21, score-0.387]
12 This direction was largely influenced by Johnstone’s spiked covariance model, where data samples are drawn from a distribution whose (population) covariance is a low-rank perturbation of the identity matrix Johnstone (2001). [sent-22, score-0.704]
13 The only currently available algorithms with provable sample complexity guarantees either store all n = O(p) samples (note that for more than a single pass over the data, the samples must all be stored) or explicitly form or approximate the empirical p × p (typically dense) covariance matrix. [sent-25, score-0.575]
14 All cases require as much as O(p2 ) storage for exact recovery. [sent-26, score-0.153]
15 , p often is of the order of 1010 − 1012 , making the need for O(p2 ) memory prohibitive. [sent-28, score-0.17]
16 At many computing scales, manipulating vectors of length O(p) is possible, when storage of O(p2 ) is not. [sent-29, score-0.153]
17 A typical desktop may have 10-20 GB of RAM, but will not have more than a few TB of total storage. [sent-30, score-0.067]
18 In distributed storage systems, the scalability in storage comes at the heavy cost of communication. [sent-32, score-0.306]
19 In this light, we consider the streaming data setting, where the samples xt ∈ Rp are collected sequentially, and unless we store them, they are irretrievably gone. [sent-33, score-0.521]
20 1 On the spiked covariance model (and natural generalizations), we show that a simple algorithm requiring O(kp) storage – the best possible – performs as well as batch algorithms (namely, SVD on the empirical covariance matrix), with sample complexity O(p log p). [sent-34, score-1.092]
21 To the best of our knowledge, this is the only algorithm with both storage complexity and sample complexity guarantees. [sent-35, score-0.379]
22 2 Related Work Memory- and computation-efficient algorithms that operate on streaming data are plentiful in the literature and many seem to do well in practice. [sent-41, score-0.248]
23 However, there is no algorithm that provably recovers the principal components in the same noise and sample-complexity regime as the batch PCA algorithm does and maintains a provably light memory footprint. [sent-42, score-0.811]
24 (2012) typically requires much less memory, but there are no guarantees for this, and moreover, for certain problem instances, its memory requirement is on the order of p2 . [sent-53, score-0.272]
25 These save on memory and computation by performing SVD on the resulting smaller matrix. [sent-58, score-0.17]
26 More fundamentally, these approaches are not appropriate for data coming from a statistical model such as the spiked covariance model. [sent-60, score-0.382]
27 It is clear that subsampling approaches, for instance, simply correspond to discarding most of the data, and for fundamental sample complexity reasons, cannot work. [sent-61, score-0.151]
28 , independent Gaussian vectors, then so will each element of the sketch, and thus this approach again runs against fundamental sample complexity constraints. [sent-65, score-0.151]
29 Indeed, it is straightforward to check that the guarantees presented in (Clarkson & Woodruff (2009); Halko et al. [sent-66, score-0.104]
30 (2011)) seek to have the best subspace estimate at every time (i. [sent-74, score-0.084]
31 , each time a new data sample arrives) but without performing full-blown SVD at each step. [sent-76, score-0.076]
32 While these algorithms indeed reduce both the computational and memory burden of batch-PCA, there are no rigorous guarantees on the quality of the principal components or on the statistical performance of these methods. [sent-77, score-0.467]
33 In a Bayesian mindset, some researchers have come up with expectation maximization approaches Roweis (1998); Tipping & Bishop (1999), that can be used in an incremental fashion. [sent-78, score-0.073]
34 Stochastic-approximation-based algorithms along the lines of Robbins & Monro (1951) are also quite popular, due to their low computational and memory complexity, and excellent performance. [sent-80, score-0.17]
35 They go under a variety of names, including Incremental PCA (though the term Incremental has been used in the online setting as well Herbster & Warmuth (2001)), Hebbian learning, and stochastic power method Arora et al. [sent-81, score-0.08]
36 While empirically these algorithms perform well, to the best of our knowledge - and efforts - there is no associated finite sample guarantee. [sent-85, score-0.076]
37 3 Problem Formulation and Notation We consider the streaming model: at each time step t, we receive a point xt ∈ Rp . [sent-89, score-0.392]
38 Our goal is to compute the top k principal components of the data: the k-dimensional subspace that offers the best squared-error estimate for the points. [sent-91, score-0.319]
39 Specifically, xt = Azt + wt , (2) where A ∈ Rp×k is a fixed matrix, zt ∈ Rk×1 is a multivariate normal random variable, i. [sent-93, score-0.254]
40 , p×1 and wt ∈ R zt ∼ N (0k×1 , Ik×k ), is the “noise” vector, also sampled from a multivariate normal distribution, i. [sent-95, score-0.11]
41 In this regime, it is well-known that batch-PCA is asymptotically consistent (hence recovering A up to unitary transformations) with number of samples scaling as n = O(p) Vershynin (2010b). [sent-99, score-0.073]
42 The central goal of this paper is to provide finite sample guarantees for a streaming algorithm that requires memory no more than O(kp) and matches the consistency results of batch PCA in the sampling regime n = O(p) (possibly with additional log factors, or factors depending on σ and k). [sent-101, score-0.87]
43 x q denotes the ℓq norm of x; x denotes the ℓ2 norm of x. [sent-105, score-0.072]
44 A or A 2 denotes the spectral norm of A while A F denotes the Frobenius norm of A. [sent-106, score-0.072]
45 Stochastic versions of the power method already exist in the literature; see Arora et al. [sent-121, score-0.08]
46 This motivates us to consider a modified stochastic power method algorithm, that has a variance reduction step built in. [sent-124, score-0.099]
47 At a high level, our method updates only once in a “block” and within one block we average out noise to reduce the variance. [sent-125, score-0.113]
48 Below, we first illustrate the main ideas of our method as well as our sample complexity proof for the simpler rank-1 case. [sent-126, score-0.151]
49 , xn } as “input,” we mean this in the streaming sense: the data are no-where stored, and can never be revisited unless the algorithm explicitly stores them. [sent-133, score-0.248]
50 1 Rank-One Case We first consider the rank-1 case for which each sample xt is generated using: xt = uzt + wt where u ∈ Rp is the principal component that we wish to recover. [sent-135, score-0.667]
51 Our algorithm is a block-wise method where all the n samples are divided in n/B blocks (for simplicity we assume that n/B is an integer). [sent-136, score-0.11]
52 In the (τ + 1)-st block, we compute B(τ +1) 1 sτ +1 = xt x⊤ qτ . [sent-137, score-0.144]
53 (3) t B t=Bτ +1 Then, the iterate qτ is updated using qτ +1 = sτ +1 / sτ +1 2 . [sent-138, score-0.075]
54 1 Analysis We now present the sample complexity analysis of our proposed method. [sent-143, score-0.151]
55 99, qT − u 2 ≤ ǫ, where qT is the T -th ǫ2 iterate of Algorithm 1. [sent-156, score-0.075]
56 That is, Algorithm 1 obtains an ǫ-accurate solution with number of samples (n) given by: 2 √ 2 ˜ (1 + 3(σ + σ ) p) log(p/ǫ) . [sent-157, score-0.073]
57 5)) ˜ Note that in the total sample complexity, we use the notation Ω(·) to suppress the extra log(T ) factor for clarity of exposition, as T already appears in the expression linearly. [sent-160, score-0.125]
58 The proof decomposes the current iterate into the component of the current iterate, qτ , in the direction of the true principal component (the spike) u, and the perpendicular component, showing that the former eventually dominates. [sent-162, score-0.535]
59 Doing so hinges on three key components: (a) for large enough B(τ +1) 1 B, the empirical covariance matrix Fτ +1 = B t=Bτ +1 xt x⊤ is close to the true covariance matrix t M = uu⊤ + σ 2 I, i. [sent-163, score-0.554]
60 99 (or any other constant probability), the √ initial point q0 has a component of at least O(1/ p) magnitude along the true direction u; (c) after τ iterations, the error in estimation is at most O(γ τ ) where γ < 1 is a constant. [sent-167, score-0.08]
61 Let B, T and the data stream {xi } be as defined in the theorem. [sent-171, score-0.072]
62 Then: • (Lemma 4): With probability 1 − C/T , for C a universal constant, we have: 1 B t xt x⊤ − uu⊤ − σ 2 I t 2 ≤ ǫ. [sent-172, score-0.183]
63 • (Lemma 5): With probability 1 − C/T , for C a universal constant, we have: ǫ , u⊤ sτ +1 ≥ u⊤ qτ (1 + σ 2 ) 1 − 4(1 + σ 2 ) where st = 1 B Bτ 0 is a universal constant. [sent-173, score-0.078]
64 √ √ Let qτ =√ 1 − δτ u+ δτ gτ , 1 ≤ τ ≤ n/B, where gτ is the component of qτ that is perpendicular to u and 1 − δτ is the magnitude of the component of qτ along u. [sent-177, score-0.301]
65 ≥ 1 − C/T : ⊤ ⊤ gτ +1 sτ +1 = σ 2 gτ +1 qτ + gτ +1 2 Eτ 2 qτ 2 ≤ σ2 xt x⊤ . [sent-185, score-0.144]
66 2 General Rank-k Case In this section, we consider the general rank-k PCA problem where each sample is assumed to be generated using the model of equation (2), where A ∈ Rp×k represents the k principal components that need to be recovered. [sent-204, score-0.311]
67 Similar to the rank-1 problem, our algorithm for the rank-k problem can be viewed as a streaming variant of the classical orthogonal iteration used for SVD. [sent-213, score-0.309]
68 Interestingly, the analysis is entirely different from the standard analysis of the orthogonal iteration as there, the empirical estimate of the covariance matrix is fixed while in our case it varies with each block. [sent-215, score-0.266]
69 For the spiked covariance model, it is straightforward to see that this is equivalent to the usual PCA figure-of-merit, the expressed variance. [sent-217, score-0.382]
70 Consider a data stream, where xt ∈ Rp for every t is generated by (2), and the SVD of A ∈ Rp×k is given by A = U ΛV ⊤ . [sent-219, score-0.144]
71 Hence, the sufficient number of samples for ǫ-accurate recovery of all the top-k principal components is: √ √ √ 2 (1 + σ)2 k + σ 1 + σ 2 k p log(p/kǫ) ˜ n = Ω . [sent-228, score-0.371]
72 The key part of the proof requires the following additional lemmas that bound the energy of the current iterate along the desired subspace and its perpendicular space (Lemmas 8 and 9), and Lemma 10, which controls the quality of the initialization. [sent-232, score-0.399]
73 Let the data stream, A, B, and T be as defined in Theorem 3, σ be the 1 variance of noise, Fτ +1 = B Bτ 0) then the initialization step provides us better distance, i. [sent-234, score-0.106]
74 This initialization step enables us to give tighter sample √ complexity as the r p in the numerator above can be replaced by rp. [sent-237, score-0.196]
75 5 Experiments In this section, we show that, as predicted by our theoretical results, our algorithm performs close to the optimal batch SVD. [sent-238, score-0.182]
76 We provide the results from simulating the spiked covariance model, and demonstrate the phase-transition in the probability of successful recovery that is inherent to the statistical problem. [sent-239, score-0.445]
77 Then we stray from the analyzed model and performance metric and test our algorithm on real world–and some very big–datasets, using the metric of explained variance. [sent-240, score-0.093]
78 As expected, our algorithm exhibits linear scaling with respect to the ambient dimension p – the same as the batch SVD. [sent-245, score-0.223]
79 The missing point on batch SVD’s curve (Figure 1(a)), corresponds to p > 2. [sent-246, score-0.182]
80 Performing SVD on a dense p × p matrix, either fails or takes a very long time on most modern desktop computers; in contrast, our streaming algorithm easily runs on this size problem. [sent-248, score-0.315]
81 The phase transition plot in Figure 1(b) 7 Samples to retrieve spike (σ=0. [sent-249, score-0.08]
82 2M samples, p=140K 10% 0% 1 10 (c) 2 3 4 5 6 k (number of components) 7 (d) Figure 1: (a) Number of samples required for recovery of a single component (k = 1) from the spiked covariance model, with noise standard deviation σ = 0. [sent-260, score-0.646]
83 (b) Fraction of trials in which Algorithm 1 successfully recovers the principal component (k = 1) in the same model, with ǫ = 0. [sent-263, score-0.239]
84 05 and n = 1000 samples, (c) Explained variance by Algorithm 1 compared to the optimal batch SVD, on the NIPS bag-of-words dataset. [sent-264, score-0.243]
85 shows the empirical sample complexity on a large class of problems and corroborates the scaling with respect to the noise variance we obtain theoretically. [sent-266, score-0.26]
86 Figures 1 (c)-(d) complement our complete treatment of the spiked covariance model, with some out-of-model experiments. [sent-267, score-0.382]
87 We evaluated our algorithm’s performance with respect to the fraction of explained variance metric: given the p × k matrix V output from the algorithm, and all the provided samples in matrix X, the fraction of explained variance is defined as Tr(V T XX T V )/ Tr(XX T ). [sent-270, score-0.471]
88 To be consistent with our theory, for a dataset of n samples of dimension p, we set the number of blocks to be T = ⌈log(p)⌉ and the size of blocks to B = ⌊n/T ⌋ in our algorithm. [sent-271, score-0.147]
89 The NIPS dataset is the smallest, with 1500 documents and 12K words and allowed us to compare our algorithm with the optimal, batch SVD. [sent-272, score-0.182]
90 The figure is consistent with our theoretical result: our algorithm performs as well as the batch, with an added log(p) factor in the sample complexity. [sent-275, score-0.076]
91 Both the NY Times and PubMed datasets are of prohibitive size for traditional batch methods – the latter including 8. [sent-277, score-0.182]
92 It was able to extract the top 7 components for each dataset in a few hours on a desktop computer. [sent-279, score-0.143]
93 A second pass was made on the data to evaluate the results, and we saw 7-10 percent of the variance explained on spaces with p > 104 . [sent-280, score-0.154]
94 Online identification and tracking of subspaces from highly incomplete information. [sent-290, score-0.066]
95 Fast low-rank modifications of the thin singular value decomposition. [sent-294, score-0.083]
96 Incremental singular value decomposition of uncertain data with missing values. [sent-297, score-0.083]
97 Tracking a few extreme singular values and vectors in signal processing. [sent-308, score-0.083]
98 On the distribution of the largest eigenvalue in principal components analysis. [sent-327, score-0.235]
99 Finite sample approximation results for principal component analysis: a matrix perturbation approach. [sent-339, score-0.404]
100 How close is the sample covariance matrix to the actual covariance matrix? [sent-362, score-0.441]
wordName wordTfidf (topN-words)
[('streaming', 0.248), ('spiked', 0.222), ('svd', 0.217), ('batch', 0.182), ('pca', 0.178), ('memory', 0.17), ('covariance', 0.16), ('principal', 0.159), ('kp', 0.157), ('storage', 0.153), ('dist', 0.15), ('vershynin', 0.15), ('xt', 0.144), ('perpendicular', 0.141), ('mitliagkas', 0.13), ('warmuth', 0.119), ('arora', 0.117), ('qt', 0.115), ('rp', 0.115), ('lemmas', 0.099), ('constantine', 0.098), ('uu', 0.098), ('explained', 0.093), ('ioannis', 0.09), ('brand', 0.09), ('pubmed', 0.09), ('kuzmin', 0.09), ('woodruff', 0.09), ('golub', 0.087), ('clarkson', 0.084), ('halko', 0.084), ('subspace', 0.084), ('singular', 0.083), ('component', 0.08), ('spike', 0.08), ('allerton', 0.077), ('sample', 0.076), ('components', 0.076), ('complexity', 0.075), ('iterate', 0.075), ('balzano', 0.074), ('herbster', 0.074), ('samples', 0.073), ('incremental', 0.073), ('stream', 0.072), ('johnstone', 0.069), ('span', 0.068), ('regime', 0.068), ('desktop', 0.067), ('tracking', 0.066), ('comon', 0.065), ('proj', 0.065), ('loan', 0.065), ('monro', 0.065), ('block', 0.065), ('wt', 0.064), ('log', 0.064), ('recovery', 0.063), ('guarantees', 0.062), ('variance', 0.061), ('orthogonal', 0.061), ('nadler', 0.06), ('gb', 0.06), ('wlog', 0.056), ('sketching', 0.056), ('store', 0.056), ('stored', 0.054), ('provably', 0.054), ('porteous', 0.054), ('lemma', 0.052), ('bag', 0.051), ('ip', 0.051), ('tb', 0.049), ('caramanis', 0.049), ('tipping', 0.049), ('suppress', 0.049), ('noise', 0.048), ('zt', 0.046), ('matrix', 0.045), ('bishop', 0.045), ('initialization', 0.045), ('perturbation', 0.044), ('robbins', 0.044), ('manfred', 0.043), ('et', 0.042), ('ambient', 0.041), ('requirement', 0.04), ('ram', 0.04), ('universal', 0.039), ('xx', 0.038), ('power', 0.038), ('jain', 0.037), ('blocks', 0.037), ('therein', 0.037), ('roweis', 0.037), ('regret', 0.037), ('ny', 0.036), ('norm', 0.036), ('meanwhile', 0.036), ('arxiv', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 188 nips-2013-Memory Limited, Streaming PCA
Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain
Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1
2 0.21984677 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe
Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1
3 0.20780893 233 nips-2013-Online Robust PCA via Stochastic Optimization
Author: Jiashi Feng, Huan Xu, Shuicheng Yan
Abstract: Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust PCA (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the number of samples, significantly enhancing the computation and storage efficiency. The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods. 1
4 0.1792589 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
Author: Yuhong Guo
Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Specifically, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components specific to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efficient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and significantly outperform both standard PCA and robust PCA with rank constraints. 1
5 0.17278947 232 nips-2013-Online PCA for Contaminated Data
Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan
Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1
6 0.16969554 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
7 0.13844796 91 nips-2013-Dirty Statistical Models
8 0.1370241 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
9 0.13559929 247 nips-2013-Phase Retrieval using Alternating Minimization
10 0.13059512 224 nips-2013-On the Sample Complexity of Subspace Learning
11 0.12331086 269 nips-2013-Regression-tree Tuning in a Streaming Setting
12 0.11879069 137 nips-2013-High-Dimensional Gaussian Process Bandits
13 0.11538184 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
14 0.11009249 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
15 0.10640953 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
16 0.10547861 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
17 0.10528146 324 nips-2013-The Fast Convergence of Incremental PCA
18 0.10315633 89 nips-2013-Dimension-Free Exponentiated Gradient
19 0.10009585 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
20 0.097536221 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
topicId topicWeight
[(0, 0.283), (1, 0.057), (2, 0.168), (3, 0.046), (4, -0.104), (5, 0.033), (6, -0.073), (7, 0.148), (8, -0.112), (9, -0.109), (10, -0.104), (11, 0.096), (12, 0.084), (13, 0.186), (14, 0.002), (15, -0.033), (16, 0.043), (17, 0.044), (18, 0.04), (19, 0.008), (20, 0.036), (21, -0.04), (22, -0.029), (23, -0.019), (24, 0.052), (25, 0.088), (26, -0.008), (27, 0.044), (28, -0.04), (29, -0.024), (30, 0.086), (31, 0.004), (32, -0.044), (33, 0.01), (34, -0.013), (35, -0.026), (36, -0.06), (37, 0.058), (38, 0.005), (39, -0.058), (40, -0.005), (41, -0.049), (42, 0.018), (43, -0.004), (44, 0.034), (45, -0.03), (46, -0.058), (47, -0.011), (48, -0.014), (49, -0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.95057726 188 nips-2013-Memory Limited, Streaming PCA
Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain
Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1
2 0.87858069 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
Author: Raman Arora, Andy Cotter, Nati Srebro
Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as “Matrix Stochastic Gradient” (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. 1
3 0.86046427 233 nips-2013-Online Robust PCA via Stochastic Optimization
Author: Jiashi Feng, Huan Xu, Shuicheng Yan
Abstract: Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust PCA (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the number of samples, significantly enhancing the computation and storage efficiency. The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods. 1
4 0.82487941 232 nips-2013-Online PCA for Contaminated Data
Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan
Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1
5 0.78312457 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe
Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efficiently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1
6 0.78157723 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
7 0.70243102 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
8 0.66500968 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
9 0.65309989 224 nips-2013-On the Sample Complexity of Subspace Learning
10 0.62660158 91 nips-2013-Dirty Statistical Models
11 0.61585927 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
12 0.58448792 324 nips-2013-The Fast Convergence of Incremental PCA
13 0.58384365 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
14 0.58167434 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
15 0.57237047 247 nips-2013-Phase Retrieval using Alternating Minimization
16 0.56817353 284 nips-2013-Robust Spatial Filtering with Beta Divergence
17 0.55976439 137 nips-2013-High-Dimensional Gaussian Process Bandits
18 0.55800992 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
19 0.55800694 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
20 0.55286443 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
topicId topicWeight
[(16, 0.059), (33, 0.153), (34, 0.106), (36, 0.284), (41, 0.014), (49, 0.032), (56, 0.132), (70, 0.018), (85, 0.06), (89, 0.034), (93, 0.04), (95, 0.018)]
simIndex simValue paperId paperTitle
1 0.83662385 317 nips-2013-Streaming Variational Bayes
Author: Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael Jordan
Abstract: We present SDA-Bayes, a framework for (S)treaming, (D)istributed, (A)synchronous computation of a Bayesian posterior. The framework makes streaming updates to the estimated posterior according to a user-specified approximation batch primitive. We demonstrate the usefulness of our framework, with variational Bayes (VB) as the primitive, by fitting the latent Dirichlet allocation model to two large-scale document collections. We demonstrate the advantages of our algorithm over stochastic variational inference (SVI) by comparing the two after a single pass through a known amount of data—a case where SVI may be applied—and in the streaming setting, where SVI does not apply. 1
same-paper 2 0.81020147 188 nips-2013-Memory Limited, Streaming PCA
Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain
Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the first algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1
3 0.79315084 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors
Author: Karin C. Knudson, Jonathan W. Pillow
Abstract: Entropy rate quantifies the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufficient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difficult to infer from limited data. Our approach mitigates this difficulty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1
4 0.77871066 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
Author: Prashanth L.A., Mohammad Ghavamzadeh
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance-related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application. 1
5 0.75338089 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test
Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko
Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efficient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1
6 0.69679457 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
7 0.68677801 233 nips-2013-Online Robust PCA via Stochastic Optimization
8 0.68189812 232 nips-2013-Online PCA for Contaminated Data
9 0.66919208 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
10 0.66654521 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
11 0.66441911 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
12 0.66365987 249 nips-2013-Polar Operators for Structured Sparse Estimation
13 0.65973639 324 nips-2013-The Fast Convergence of Incremental PCA
14 0.6565057 13 nips-2013-A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
15 0.65542775 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
16 0.65522039 25 nips-2013-Adaptive Anonymity via $b$-Matching
17 0.65378851 60 nips-2013-Buy-in-Bulk Active Learning
18 0.65144187 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
19 0.64955562 355 nips-2013-Which Space Partitioning Tree to Use for Search?
20 0.64929062 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex