nips nips2013 nips2013-324 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund
Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both. 1
Reference: text
sentIndex sentText sentNum sentScore
1 from some distribution with mean zero and unknown covariance A. [sent-10, score-0.036]
2 We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. [sent-11, score-0.36]
3 1 Introduction Principal component analysis (PCA) is a popular form of dimensionality reduction that projects a data set on the top eigenvector(s) of its covariance matrix. [sent-14, score-0.07]
4 The default method for computing these eigenvectors uses O(d2 ) space for data in Rd , which can be prohibitive in practice. [sent-15, score-0.049]
5 It is therefore of interest to study incremental schemes that take one data point at a time, updating their estimates of the desired eigenvectors with each new point. [sent-16, score-0.201]
6 For the case of the top eigenvector, this problem has long been studied, and two elegant solutions were obtained by Krasulina [7] and Oja [9]. [sent-18, score-0.034]
7 At time n − 1, they have some estimate Vn−1 ∈ Rd of the top eigenvector. [sent-20, score-0.034]
8 from a distribution on Rd with mean zero and covariance matrix A. [sent-28, score-0.036]
9 The original papers proved that these estimators converge almost surely to the top eigenvector of A (call it v ∗ ) under mild conditions: • n γn = ∞ while n 2 γn < ∞. [sent-29, score-0.13]
10 • If λ1 , λ2 denote the top two eigenvalues of A, then λ1 > λ2 . [sent-30, score-0.069]
11 There are also other incremental estimators for which convergence has not been established; see, for instance, [12] and [16]. [sent-32, score-0.206]
12 In this paper, we analyze the rate of convergence of the Krasulina and Oja estimators. [sent-33, score-0.097]
13 They can be treated in a common framework, as stochastic approximation algorithms for maximizing the 1 Rayleigh quotient v T Av . [sent-34, score-0.092]
14 G(v) = v 2 v v G(v) = T Since EXn Xn = A, we see that Krasulina’s method is stochastic gradient descent. [sent-37, score-0.055]
15 Recently, there has been a lot of work on rates of convergence for stochastic gradient descent (for instance, [11]), but this has typically been limited to convex cost functions. [sent-39, score-0.164]
16 We measure the quality of the solution Vn at time n using the potential function Ψn = 1 − (Vn · v ∗ )2 , Vn 2 where v ∗ is taken to have unit norm. [sent-42, score-0.047]
17 This quantity lies in the range [0, 1], and we are interested in the rate at which it approaches zero. [sent-43, score-0.048]
18 The result, in brief, is that E[Ψn ] = O(1/n), under conditions that are similar to those above, but stronger. [sent-44, score-0.032]
19 Initialize Vno uniformly at random from the unit sphere in Rd . [sent-53, score-0.1]
20 The first step is similar to using a learning rate of the form γn = c/(n + no ), as is often done in stochastic gradient descent implementations [1]. [sent-61, score-0.126]
21 , ±σed , where the ei are coordinate directions and 0 < σ < 1 is a small constant. [sent-72, score-0.096]
22 Suppose further that the distribution of X is specified by a single positive number p < 1: p Pr(X = e1 ) = Pr(X = −e1 ) = 2 1−p Pr(X = σei ) = Pr(X = −σei ) = for i > 1 2(d − 1) Then X has mean zero and covariance diag(p, σ 2 (1 − p)/(d − 1), . [sent-73, score-0.036]
23 We will assume that p and σ are chosen so that p > σ 2 (1 − p)/(d − 1); in our notation, the top eigenvalues are then λ1 = p and λ2 = σ 2 (1 − p)/(d − 1), and the target vector is v ∗ = e1 . [sent-77, score-0.069]
24 2 If Vn is ever orthogonal to some ei , it will remain so forever. [sent-78, score-0.093]
25 If Vno is initialized to a random data point, then with probability 1 − p, it will be assigned to some ei with i > 1, and will converge to a multiple of that same ei rather than to e1 . [sent-80, score-0.14]
26 Setting Vno to a random unit vector avoids this problem. [sent-82, score-0.047]
27 3 The setting of the learning rate In order to get a sense of what rates of convergence we might expect, let’s return to the example of a T random vector X with 2d possible values. [sent-85, score-0.134]
28 Since the Xn correspond to coordinate directions, each update changes just one coordinate of V : Xn = ±e1 =⇒ Vn,1 = Vn−1,1 (1 + γn ) Xn = ±σei =⇒ Vn,i = Vn−1,i (1 + σ 2 γn ) Recall that we initialize Vno to a random vector from the unit sphere. [sent-87, score-0.124]
29 For simplicity, let’s just suppose that no = 0 and that this initial value is the all-ones vector (again, we don’t have to worry about normalization). [sent-88, score-0.079]
30 2 1 2 Vn n + (d − 1)n2cλ2 n (This is all very rough, but can be made precise by obtaining concentration bounds for ln Vn,i . [sent-92, score-0.029]
31 ) From this, we can see that it is not possible to achieve a O(1/n) rate unless c ≥ 1/(2(λ1 − λ2 )). [sent-93, score-0.048]
32 An interesting practical question, to which we do not have an answer, is how one would empirically set c without prior knowledge of the eigenvalue gap. [sent-95, score-0.036]
33 4 Nested sample spaces For n ≥ no , let Fn denote the sigma-field of all outcomes up to and including time n: Fn = σ(Vno , Xno +1 , . [sent-97, score-0.05]
34 For instance, if the initial Vno is picked uniformly at random from the surface of the unit sphere in Rd , then we’d expect Ψno ≈ 1 − 1/d. [sent-103, score-0.184]
35 This means that the initial rate of decrease is very small, because of the (1 − Ψn−1 ) term. [sent-104, score-0.072]
36 We use martingale large deviation bounds to bound the length of each epoch, and also to argue that Ψn does not regress. [sent-106, score-0.095]
37 In particular, we establish a sequence of times nj such that (with high probability) sup Ψn ≤ 1 − n≥nj 3 2j . [sent-107, score-0.304]
38 d (1) The analysis of each epoch uses martingale arguments, but at the same time, assumes that Ψn remains bounded above. [sent-108, score-0.127]
39 Let Ω denote the sample space of all realizations (vno , xno +1 , xno +2 , . [sent-110, score-0.359]
40 For any δ > 0, we define a nested sequence of spaces Ω ⊃ Ωno ⊃ Ωno +1 ⊃ · · · such that each Ωn is Fn−1 -measurable, has probability P (Ωn ) ≥ 1 − δ, and moreover consists exclusively of realizations ω ∈ Ω that satisfy the constraints (1) up to and including time n − 1. [sent-114, score-0.112]
41 We can then build martingale arguments by restricting attention to Ωn when computing the conditional expectations of quantities at time n. [sent-115, score-0.101]
42 The eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λd of A satisfy λ1 > λ2 . [sent-122, score-0.035]
43 Under these conditions, we get the following rate of convergence for the Krasulina update. [sent-124, score-0.097]
44 Set the step sizes to γn = c/n, where c = co /(2(λ1 − λ2 )), and set the starting time to no ≥ (Ao B 2 c2 d2 /δ 4 ) ln(1/δ). [sent-129, score-0.159]
45 Then there is a nested sequence of subsets of the sample space Ω ⊃ Ωno ⊃ Ωno +1 ⊃ · · · such that for any n ≥ no , we have: P (Ωn ) ≥ 1 − δ and En (Vn · v ∗ )2 ≥1− Vn 2 c2 B 2 eco /no 2(co − 2) 1 − A1 n+1 d δ2 a no + 1 n+1 co /2 , where En denotes expectation restricted to Ωn . [sent-130, score-0.171]
46 Since co > 2, this bound is of the form En [Ψn ] = O(1/n). [sent-131, score-0.138]
47 We also remark that a small modification to the final step in the proof of the above yields a rate of En [Ψn ] = O(n−a ), with an identical definition of En [Ψn ]. [sent-133, score-0.048]
48 Such works do provide finite-sample guarantees, but they apply only to the batch case and/or are computationally intensive, rather than considering an efficient incremental algorithm. [sent-138, score-0.129]
49 Among incremental algorithms, the work of Warmuth and Kuzmin [15] describes and analyzes worst-case online PCA, using an experts-setting algorithm with a super-quadratic per-iteration cost. [sent-139, score-0.163]
50 More efficient general-purpose incremental PCA algorithms have lacked finite-sample analyses [2]. [sent-140, score-0.129]
51 The present paper directly analyzes the oldest known incremental PCA algorithms under relatively mild assumptions. [sent-142, score-0.183]
52 4 An additional piece of notation: we will use u to denote u/ u , the unit vector in the direction of u ∈ Rd . [sent-149, score-0.047]
53 Thus, for instance, the Rayleigh quotient can be written G(v) = v T Av. [sent-150, score-0.061]
54 Recall from the algorithm specification that we advance the clock so as to skip the pre-no phase. [sent-172, score-0.038]
55 If the initial estimate Vno√ a random unit vector, then is E[Ψno ] = 1 − 1/d and, roughly speaking, Pr(Ψno > 1 − /d) = O( ). [sent-174, score-0.071]
56 Suppose the initial estimate Vno is chosen uniformly at random from the surface of the unit sphere in Rd . [sent-179, score-0.161]
57 Then for any 0 < < 1, if no ≥ 2B 2 c2 d2 / 2 , we have Pr sup Ψn ≥ 1 − n≥no d ≤ √ 2e . [sent-181, score-0.061]
58 To prove this, we start with a simple recurrence for the moment-generating function of Ψn . [sent-182, score-0.061]
59 This relation shows how to define a supermartingale based on etYn , from which we can derive a large deviation bound on Yn . [sent-188, score-0.049]
60 Then, for any integer m and any ∆, t > 0, Pr sup Yn ≥ ∆ ≤ E[etYm ] exp − t ∆ − n≥m (β + tζ 2 /8) . [sent-193, score-0.061]
61 Suppose a vector V is picked uniformly at random from the surface of the unit sphere in Rd , where d ≥ 3. [sent-197, score-0.16]
62 3 Intermediate epochs of improvement We have seen that, for suitable and no , it is likely that Ψn ≤ 1 − /d for all n ≥ no . [sent-203, score-0.05]
63 We now define a series of epochs in which 1 − Ψn successively doubles, until Ψn finally drops below 1/2. [sent-204, score-0.071]
64 To do this, we specify intermediate goals (no , o ), (n1 , 1 ), (n2 , 2 ), . [sent-205, score-0.054]
65 , (nJ , · · · < nJ and o < 1 < · · · < J = 1/2, with the intention that: For all 0 ≤ j ≤ J, we have sup Ψn ≤ 1 − J ), where no < n1 < j. [sent-208, score-0.061]
66 Let Ω denote the sample space of all realizations (vno , xno +1 , xno +2 , . [sent-210, score-0.359]
67 Call this set Ω : Ω = {ω ∈ Ω : sup Ψn (ω) ≤ 1 − j for all 0 ≤ j ≤ J}. [sent-216, score-0.061]
68 n≥nj For technical reasons, we also need to look at realizations that are good up to time n−1. [sent-217, score-0.057]
69 Specifically, for each n, define Ωn = {ω ∈ Ω : sup Ψ (ω) ≤ 1 − j for all 0 ≤ j ≤ J}. [sent-218, score-0.061]
70 We can talk about expectations under the distribution P restricted to subsets of Ω. [sent-220, score-0.031]
71 As for expectations with respect to Pn , for any function f : Ω → R, we define 1 En f = f (ω)P (dω). [sent-222, score-0.031]
72 Assume that γn = c/n, where c = co /(2(λ1 − λ2 )) and co > 0. [sent-226, score-0.276]
73 , (nJ , J ) that satisfies the conditions o = δ2 8ed , and 3 2 j ≤ j+1 ≤2 j for 0 ≤ j < J, and J−1 ≤ 1 4 (3) (nj+1 + 1) ≥ e5/co (nj + 1) for 0 ≤ j < J as well as no ≥ (20c2 B 2 / 2 ) ln(4/δ). [sent-230, score-0.032]
74 Suppose also that γn = c/n, where c = co /(2(λ1 − λ2 )). [sent-236, score-0.138]
75 Then for any t > 0, En [etΨn ] ≤ En exp tΨn−1 1 − 6 co j n exp c2 B 2 t(1 + 32t) 4n2 . [sent-237, score-0.138]
76 Then for 0 ≤ j < J and any t > 0, Enj+1 [etΨnj+1 ] ≤ exp t(1 − j+1 ) −t j + tc2 B 2 (1 + 32t) 4 1 1 − nj nj+1 . [sent-249, score-0.243]
77 Now that we have bounds on the moment-generating functions of intermediate Ψn , we can apply martingale deviation bounds, as in Lemma 2. [sent-250, score-0.126]
78 4 The final epoch Recall the definition of the intermediate goals (nj , j ) in (2), (3). [sent-259, score-0.111]
79 The final epoch is the period n ≥ nJ , at which point Ψn ≤ 1/2. [sent-260, score-0.057]
80 8 captures the rate at which Ψ decreases during this phase. [sent-264, score-0.048]
81 By solving this recurrence relation, and piecing together the various epochs, we get the overall convergence result of Theorem 1. [sent-268, score-0.11]
82 11 closely resembles the recurrence relation followed by the squared L2 distance from the optimum of stochastic gradient descent (SGD) on a strongly convex function [11]. [sent-271, score-0.163]
83 As Ψn → 0, the incremental PCA algorithms we study have convergence rates of the same form as SGD in this scenario. [sent-272, score-0.215]
84 3 Experiments When performing PCA in practice with massive d and a large/growing dataset, an incremental method like that of Krasulina or Oja remains practically viable, even as quadratic-time and -memory algorithms become increasingly impractical. [sent-273, score-0.129]
85 [2] have a more complete discussion of the empirical necessity of incremental PCA algorithms, including a version of Oja’s method which is shown to be extremely competitive in practice. [sent-275, score-0.129]
86 Since the efficiency benefits of these types of algorithms are well understood, we now instead focus on the effect of the learning rate on the performance of Oja’s algorithm (results for Krasulina’s are extremely similar). [sent-276, score-0.048]
87 1 and the discussion in the introduction that varying c (the constant in the learning rate) will influence the overall rate of convergence. [sent-281, score-0.048]
88 In particular, if c is low, then halving it can be expected to halve the exponent of n, and the slope of the log-log convergence graph (ref. [sent-282, score-0.049]
89 The dotted line in that figure is a convergence rate of 1/n, drawn as a guide. [sent-288, score-0.097]
90 First, the convergence rates of the two incremental schemes depend on the multiplier c in the learning rate γn . [sent-295, score-0.286]
91 If it is too low, convergence will be slower than O(1/n). [sent-296, score-0.049]
92 If it is too high, the constant in the rate of convergence will be large. [sent-297, score-0.097]
93 Second, what can be said about incrementally estimating the top p eigenvectors, for p > 1? [sent-299, score-0.061]
94 d In Oja’s algorithm, for instance, when a new data point Xn ∈ R arrives, the following update is performed: T Wn = Vn−1 + γn Xn Xn Vn−1 Vn = orth(Wn ) where the second step orthonormalizes the columns, for instance by Gram-Schmidt. [sent-301, score-0.052]
95 It would be interesting to characterize the rate of convergence of this scheme. [sent-302, score-0.097]
96 A method of stochastic approximation for the determination of the least eigenvalue of a symmetrical matrix. [sent-349, score-0.067]
97 On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. [sent-364, score-0.115]
98 Making gradient descent optimal for strongly convex stochastic optimization. [sent-371, score-0.078]
99 Minimax rates of estimation for sparse PCA in high dimensions. [sent-387, score-0.037]
100 On the convergence of eigenspaces in kernel principal component analysis. [sent-406, score-0.098]
wordName wordTfidf (topN-words)
[('vn', 0.529), ('oja', 0.346), ('krasulina', 0.326), ('vno', 0.301), ('nj', 0.243), ('en', 0.176), ('xn', 0.166), ('fn', 0.163), ('xno', 0.151), ('co', 0.138), ('pca', 0.13), ('incremental', 0.129), ('zn', 0.121), ('ei', 0.07), ('martingale', 0.07), ('eigenvector', 0.068), ('pr', 0.063), ('quotient', 0.061), ('recurrence', 0.061), ('sup', 0.061), ('epoch', 0.057), ('realizations', 0.057), ('rd', 0.056), ('rayleigh', 0.055), ('sphere', 0.053), ('etyn', 0.05), ('pie', 0.05), ('epochs', 0.05), ('yn', 0.05), ('eigenvectors', 0.049), ('convergence', 0.049), ('lemma', 0.048), ('rate', 0.048), ('unit', 0.047), ('diego', 0.042), ('uc', 0.04), ('arora', 0.04), ('clock', 0.038), ('rates', 0.037), ('surface', 0.037), ('cotter', 0.036), ('av', 0.036), ('ao', 0.036), ('covariance', 0.036), ('eigenvalue', 0.036), ('san', 0.036), ('eigenvalues', 0.035), ('lemmas', 0.034), ('analyzes', 0.034), ('top', 0.034), ('nested', 0.033), ('suppose', 0.033), ('cmu', 0.032), ('warmuth', 0.032), ('pn', 0.032), ('conditions', 0.032), ('stochastic', 0.031), ('intermediate', 0.031), ('expectations', 0.031), ('wn', 0.031), ('ln', 0.029), ('principal', 0.029), ('outcomes', 0.028), ('estimators', 0.028), ('arrives', 0.027), ('incrementally', 0.027), ('instance', 0.027), ('coordinate', 0.026), ('sgd', 0.026), ('deviation', 0.025), ('nc', 0.025), ('theorem', 0.025), ('update', 0.025), ('gradient', 0.024), ('relation', 0.024), ('likewise', 0.024), ('dasgupta', 0.024), ('pick', 0.024), ('initial', 0.024), ('goals', 0.023), ('descent', 0.023), ('remain', 0.023), ('schemes', 0.023), ('picked', 0.023), ('worry', 0.022), ('terascale', 0.022), ('wildly', 0.022), ('mitliagkas', 0.022), ('spacing', 0.022), ('capped', 0.022), ('zwald', 0.022), ('spaces', 0.022), ('spectrum', 0.022), ('drops', 0.021), ('starting', 0.021), ('dud', 0.02), ('oldest', 0.02), ('ussr', 0.02), ('doubles', 0.02), ('eigenspaces', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 324 nips-2013-The Fast Convergence of Incremental PCA
Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund
Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both. 1
2 0.25345826 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
Author: Ilya O. Tolstikhin, Yevgeny Seldin
Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is significantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is confirmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1
3 0.14081553 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
4 0.10953279 289 nips-2013-Scalable kernels for graphs with continuous attributes
Author: Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, Karsten Borgwardt
Abstract: While graphs with continuous node attributes arise in many applications, stateof-the-art graph kernels for comparing continuous-attributed graphs suffer from a high runtime complexity. For instance, the popular shortest path kernel scales as O(n4 ), where n is the number of nodes. In this paper, we present a class of graph kernels with computational complexity O(n2 (m + log n + δ 2 + d)), where δ is the graph diameter, m is the number of edges, and d is the dimension of the node attributes. Due to the sparsity and small diameter of real-world graphs, these kernels typically scale comfortably to large graphs. In our experiments, the presented kernels outperform state-of-the-art kernels in terms of speed and accuracy on classification benchmark datasets. 1
5 0.10949855 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
6 0.10528146 188 nips-2013-Memory Limited, Streaming PCA
7 0.10046844 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
8 0.094870619 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
9 0.089754805 300 nips-2013-Solving the multi-way matching problem by permutation synchronization
10 0.082879975 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
11 0.081885427 232 nips-2013-Online PCA for Contaminated Data
12 0.077497892 233 nips-2013-Online Robust PCA via Stochastic Optimization
13 0.076999173 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
14 0.073984146 224 nips-2013-On the Sample Complexity of Subspace Learning
15 0.066930763 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
16 0.052553751 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
17 0.050786972 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
18 0.049828928 91 nips-2013-Dirty Statistical Models
19 0.048871249 201 nips-2013-Multi-Task Bayesian Optimization
20 0.048719257 193 nips-2013-Mixed Optimization for Smooth Functions
topicId topicWeight
[(0, 0.14), (1, 0.036), (2, 0.083), (3, 0.005), (4, -0.005), (5, 0.072), (6, -0.031), (7, 0.062), (8, -0.073), (9, 0.01), (10, -0.01), (11, 0.076), (12, 0.054), (13, 0.062), (14, 0.087), (15, 0.036), (16, 0.051), (17, 0.029), (18, 0.005), (19, 0.073), (20, 0.055), (21, 0.022), (22, 0.023), (23, 0.081), (24, -0.145), (25, 0.04), (26, -0.096), (27, 0.154), (28, -0.113), (29, -0.034), (30, -0.062), (31, -0.022), (32, -0.133), (33, 0.053), (34, 0.026), (35, 0.006), (36, -0.054), (37, 0.021), (38, 0.035), (39, 0.002), (40, 0.003), (41, -0.063), (42, 0.325), (43, -0.032), (44, 0.123), (45, 0.029), (46, -0.015), (47, 0.023), (48, 0.06), (49, -0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.94065636 324 nips-2013-The Fast Convergence of Incremental PCA
Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund
Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both. 1
2 0.67737031 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
Author: Ilya O. Tolstikhin, Yevgeny Seldin
Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is significantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is confirmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1
3 0.54179746 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
4 0.51879072 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
5 0.50269353 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
6 0.43530795 224 nips-2013-On the Sample Complexity of Subspace Learning
7 0.42883462 252 nips-2013-Predictive PAC Learning and Process Decompositions
8 0.42808333 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
9 0.4244591 188 nips-2013-Memory Limited, Streaming PCA
10 0.41253021 225 nips-2013-One-shot learning and big data with n=2
11 0.40705276 300 nips-2013-Solving the multi-way matching problem by permutation synchronization
12 0.39735195 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
13 0.38987502 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
14 0.37416577 232 nips-2013-Online PCA for Contaminated Data
15 0.36535215 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
16 0.34909931 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
17 0.34648219 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse
18 0.34150481 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
19 0.33123207 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
20 0.32474953 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
topicId topicWeight
[(1, 0.267), (2, 0.017), (16, 0.032), (33, 0.129), (34, 0.107), (36, 0.03), (41, 0.034), (49, 0.023), (56, 0.105), (70, 0.033), (85, 0.047), (89, 0.054), (93, 0.029), (95, 0.014)]
simIndex simValue paperId paperTitle
1 0.83555168 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
Author: Min Xu, Tao Qin, Tie-Yan Liu
Abstract: In search advertising, the search engine needs to select the most profitable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection bias that harms the search engine by decreasing expected revenue and “estimation of the largest mean” (ELM) bias that harms the advertisers by increasing game-theoretic player-regret. We then propose simple bias-correction methods with benefits to both the search engine and the advertisers. 1
2 0.78293139 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
Author: Matthias Hein, Simon Setzer, Leonardo Jost, Syama Sundar Rangapuram
Abstract: Hypergraphs allow one to encode higher-order relationships in data and are thus a very flexible modeling tool. Current learning methods are either based on approximations of the hypergraphs via graphs or on tensor methods which are only applicable under special conditions. In this paper, we present a new learning framework on hypergraphs which fully uses the hypergraph structure. The key element is a family of regularization functionals based on the total variation on hypergraphs. 1
same-paper 3 0.77415413 324 nips-2013-The Fast Convergence of Incremental PCA
Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund
Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both. 1
4 0.72557569 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search
Author: Alexander Zimin, Gergely Neu
Abstract: We study the problem of online learning in finite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret defined as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a finite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be significantly improved under general assumptions. 1
5 0.69004643 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
Author: Tianbao Yang
Abstract: We present and study a distributed optimization algorithm by employing a stochastic dual coordinate ascent method. Stochastic dual coordinate ascent methods enjoy strong theoretical guarantees and often have better performances than stochastic gradient descent methods in optimizing regularized loss minimization problems. It still lacks of efforts in studying them in a distributed framework. We make a progress along the line by presenting a distributed stochastic dual coordinate ascent algorithm in a star network, with an analysis of the tradeoff between computation and communication. We verify our analysis by experiments on real data sets. Moreover, we compare the proposed algorithm with distributed stochastic gradient descent methods and distributed alternating direction methods of multipliers for optimizing SVMs in the same distributed framework, and observe competitive performances. 1
6 0.62637317 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers
7 0.62175483 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
8 0.62062341 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
9 0.62002611 173 nips-2013-Least Informative Dimensions
10 0.61905557 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors
11 0.61840624 233 nips-2013-Online Robust PCA via Stochastic Optimization
12 0.61773467 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
13 0.61695284 350 nips-2013-Wavelets on Graphs via Deep Learning
14 0.6165458 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
15 0.61572266 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
16 0.61525524 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
18 0.61446261 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
19 0.61431032 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
20 0.61386734 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching