nips nips2013 nips2013-209 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar
Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). [sent-10, score-0.049]
2 We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. [sent-11, score-0.468]
3 O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. [sent-14, score-0.357]
4 We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. [sent-15, score-0.31]
5 We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i. [sent-16, score-0.104]
6 , sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. [sent-19, score-0.224]
7 This paper focuses on the setting (n p), where n is the number of observations and p is the number of covariates or features, a common one for web scale data. [sent-23, score-0.068]
8 The intuition behind this approach is that these frequency domain transformations uniformize the data and smear the signal across all the observations so that there are no longer any high leverage points whose omission could unduly influence the parameter estimates. [sent-26, score-0.114]
9 Another way of looking at this approach is as preconditioning the design matrix with a carefully constructed data-independent random matrix before subsampling. [sent-28, score-0.385]
10 It is worth noting that these approaches assume a fixed design setting. [sent-31, score-0.136]
11 Novel Subsampling Algorithms for OLS: We propose three novel1 algorithms for fast estimation of OLS which work by subsampling the covariance matrix. [sent-33, score-0.393]
12 Some recent results in [8] allow us to bound the difference between the parameter vector (w) we estimate from the subsampled data and the true underlying parameter (w0 ) which generates the data. [sent-34, score-0.132]
13 We provide theoretical analysis of our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. [sent-35, score-0.293]
14 The error bound of our best algorithm, Uluru, is independent of the fraction of data subsampled (above a minimum threshold of sub-sampling) and depends only on the characteristics of the data/design matrix X. [sent-36, score-0.231]
15 Randomized Hadamard preconditioning not always needed: We show that the error bounds for all the three algorithms are similar for both the fixed design and the subGaussian random design setting. [sent-38, score-0.528]
16 In other words, one can either transform the data/design matrix via Randomized Hadamard transform (fixed design setting) and then use any of our three algorithms or, if the observations are i. [sent-39, score-0.407]
17 Thus, another contribution of this paper is to show that if the observations are i. [sent-43, score-0.045]
18 and sub-Gaussian then one does not need the slow Randomized Hadamard preconditioning step and one can get similar accuracies much faster. [sent-46, score-0.163]
19 The remainder of the paper is organized as follows: In the next section, we formally define notation for the regression problem, then in Sections 3 and 4, we describe our algorithms and provide theorems characterizing their performance. [sent-47, score-0.074]
20 Finally, we compare the empirical performance of our methods on synthetic and real world data. [sent-48, score-0.061]
21 2 Notation and Preliminaries Let X be the n × p design matrix. [sent-49, score-0.136]
22 For the random design case we assume the rows of X are n i. [sent-50, score-0.161]
23 Y is the real valued n × 1 response vector which contains n corresponding values of the dependent variable Y (in general we use bold letter for samples and normal letter for random variables or vectors). [sent-56, score-0.04]
24 More formally, we can write the true model as: Y = Xw0 + ∼iid N (0, σ 2 ) The sample solution to the above equation (in matrix notation) is given by wsample = (X X)−1 X Y and by consistency of the OLS estimator we know that wsample →d w0 as n → ∞. [sent-61, score-0.176]
25 Classical algorithms to estimate wsample use QR decomposition or bidiagonalization [9] and they require O(np2 ) floating point operations. [sent-62, score-0.077]
26 Since our algorithms are based on subsampling the covariance matrix, we need some extra notation. [sent-63, score-0.357]
27 Let r = nsubs /n (< 1) be the subsampling ratio, giving the ratio of the number of observations (nsubs ) in the subsampled matrix Xsubs fraction to the number of observations (n) in the original X matrix. [sent-64, score-0.904]
28 Let Xrem , Yrem denote the data and response vector for the remaining n − nsubs observations. [sent-68, score-0.356]
29 Also, let ΣXX be the covariance of X and ΣXY be the covariance between X and Y . [sent-70, score-0.108]
30 Then, for the fixed design setting ΣXX = X X/n and ΣXY = X Y/n and for the random design setting ΣXX = E(X X) and ΣXY = E(X Y). [sent-71, score-0.272]
31 For the fixed design setting, M SE = (w0 − w) X X(w0 − w)/n = (w0 − w) ΣXX (w0 − w) For the random design setting M SE = EX Xw0 − X w 2 = (w0 − w) ΣXX (w0 − w) 1 One of our algorithms (FS) is similar to [4] as we describe in Related Work. [sent-73, score-0.293]
32 1 Design Matrix and Preconditioning Thus far, we have not made any assumptions about the design matrix X. [sent-76, score-0.179]
33 In fact, our algorithms and analysis work for both fixed design and random design settings. [sent-77, score-0.293]
34 As mentioned earlier, our algorithms involve subsampling the observations, so we have to ensure that we do not leave behind any observations which are outliers/high leverage points; This is done differently for fixed and random designs. [sent-78, score-0.383]
35 For the fixed design setting the design matrix X is arbitrary and may contain high leverage points. [sent-79, score-0.35]
36 Therefore before subsampling we precondition the matrix by a Randomized Hadamard/Fourier Transform [1, 4] and after conditioning, the probability of having high leverage points in the new design matrix becomes very small. [sent-80, score-0.58]
37 On the other hand, if we assume X to be random design and its rows are i. [sent-81, score-0.161]
38 draws from some nice distribution like sub-Gaussian, then the probability of having high leverage points is very small and we can happily subsample X without preconditioning. [sent-84, score-0.096]
39 In this paper we analyze both the fixed as well as sub-Gaussian random design settings. [sent-85, score-0.136]
40 Since the fixed design analysis would involve transforming the design matrix with a preconditioner before subsampling, some background on SRHT is warranted. [sent-86, score-0.334]
41 Subsampled Randomized Hadamard Transform (SRHT): In the fixed design setting we precondition and subsample the data with a nsubs × n randomized hadamard transform matrix Θ(= n RHD) as Θ · X. [sent-87, score-0.988]
42 n subs The matrices R, H, and D are defined as: • R ∈ Rnsubs ×n is a set of nsubs rows from the n × n identity matrix, where the rows are chosen uniformly at random without replacement. [sent-88, score-0.406]
43 • D ∈ Rn×n is a random diagonal matrix whose entries are independent random signs, i. [sent-89, score-0.043]
44 It is worth noting that HD is the preconditioning matrix and R is the subsampling matrix. [sent-94, score-0.488]
45 3 Three subsampling algorithms for fast linear regression All our algorithms subsample the X matrix followed by a single or two stage fitting and are described below. [sent-99, score-0.517]
46 The algorithms given below are for the random design setting. [sent-100, score-0.157]
47 The algorithms for the fixed design are exactly the same as below, except that Xsubs , Ysubs are replaced by Θ · X, Θ · Y and Xrem , Yrem with Θrem · X, Θrem · Y, where Θ is the SRHT matrix defined in the previous section and Θrem is the same as Θ, except that R is of size nrem × n. [sent-101, score-0.237]
48 Full Subsampling (FS): Full subsampling provides a baseline for comparison; In it we simply r-subsample (X, Y) as (Xsubs , Ysubs ) and use the subsampled data to estimate both the ΣXX and ΣXY covariance matrices. [sent-103, score-0.45]
49 Covariance Subsampling (CovS): In Covariance Subsampling we r-subsample X as Xsubs only to estimate the ΣXX covariance matrix; we use all the n observations to compute the ΣXY covariance matrix. [sent-104, score-0.153]
50 3 Uluru : Uluru2 is a two stage fitting algorithm. [sent-105, score-0.051]
51 In the first stage it uses the r-subsampled (X, Y) to get an initial estimate of w (i. [sent-106, score-0.051]
52 In the second stage it uses the remaining data (Xrem , Yrem ) to estimate the bias of the first stage estimator wcorrect = w0 − wF S . [sent-109, score-0.273]
53 The final estimate (wU luru ) is taken to be a weighted combination (generally just the sum) of the FS estimator and the second stage estimator (wcorrect ). [sent-110, score-0.205]
54 In the second stage, since wF S is known, on the remaining data we have Yrem = Xrem w0 + hence Rrem = Yrem − Xrem · wF S = Xrem (w0 − wF S ) + rem , rem The above formula shows we can estimate wcorrect = w0 − wF S with another regression, i. [sent-112, score-0.348]
55 Finally we correct wF S and wcorrect to get wU luru . [sent-116, score-0.262]
56 The estimate wcorrect can be seen as an almost unbiased estimation of the error w0 − wsubs , so we correct almost all the error, hence reducing the bias. [sent-117, score-0.187]
57 1 Fixed Design Setting Here we assume preconditioning and subsampling with SRHT as described in previous sections. [sent-125, score-0.445]
58 Lets consider the case nsubs nrem , since only in this situation subsampling reduces computational cost significantly. [sent-134, score-0.675]
59 2 Sub-gaussian Bounds Under the assumption that the rows of the design matrix X are i. [sent-148, score-0.204]
60 Consider the case r 1, since this is the only case where subsampling reduces computational cost significantly. [sent-151, score-0.282]
61 These errors are exactly the same as in the fixed design case. [sent-153, score-0.136]
62 3 Discussion We can make a few salient observations from the error expressions for the algorithms presented in Remarks 1 & 3. [sent-155, score-0.103]
63 The second term for the error of the Uluru algorithm does not contain r at all. [sent-156, score-0.037]
64 If it is the dominating term, which is the case if (7) r > O( p/n) p then the error of Uluru is approximately O(σ n ), which is completely independent of r. [sent-157, score-0.067]
65 7 holds), the error bound for Uluru is not a function of r. [sent-161, score-0.055]
66 7 holds, we do not increase the error by using less data in estimating the covariance 1 matrix in Uluru. [sent-163, score-0.134]
67 FS Algorithm does not have this property since its error is proportional to √r . [sent-164, score-0.037]
68 Similarly, for the CovS algorithm, when r > O( w0 2 ) σ2 (8) the second term dominates and we can conclude that the error does not change with r. [sent-165, score-0.037]
69 8 fails since it implies r > O(1) and the error bound of CovS algorithm increases with r. [sent-170, score-0.055]
70 To sum this up, Uluru has the nice property that its error bound does not increase as r gets smaller as long as r is greater than a threshold. [sent-171, score-0.055]
71 This threshold is completely independent of how noisy the data is and only depends on the characteristics of the design/data matrix (n, p). [sent-172, score-0.043]
72 4 Run Time complexity Table 1 summarizes the run time complexity and theoretically predicted error bounds for all the methods. [sent-174, score-0.054]
73 5 Experiments In this section we elucidate the relative merits of our methods by comparing their empirical performance on both synthetic and real world datasets. [sent-176, score-0.061]
74 nsubs is the number of observations in the subsample, n is the number of observations, and p is the number of predictors. [sent-183, score-0.401]
75 * indicates that no uniform error bounds are known. [sent-184, score-0.054]
76 subsample size, nsubs , for FS should be O(n/p) to keep the CPU time O(np), which leads to an accuracy of p2 /n. [sent-185, score-0.417]
77 For Uluru, to keep the CPU time O(np), nsubs should be O(n/p) or equivalently r = O(1/p). [sent-191, score-0.356]
78 As stated in the discussions after the theorems, when r ≥ O( p/n) (in this set up we want r = O(1/p), which implies n ≥ O(p3 )), Uluru has error bound O( p/n) no matter what signal noise ratio the problem has. [sent-192, score-0.055]
79 2 Synthetic Datasets We generated synthetic data by distributing the signal uniformly across all the p singular values, picking the p singular values to be λi = 1/i2 , i = 1 : p, and further varying the amount of signal. [sent-194, score-0.056]
80 3 Real World Datasets We also compared the performance of the algorithms on two UCI datasets 3 : CPUSMALL (n=8192, p=12) and CADATA (n=20640, p=8) and the PERMA sentiment analysis dataset described in [11] (n=1505, p=30), which uses LR-MVL word embeddings [12] as features. [sent-196, score-0.062]
81 4 Results The results for synthetic data are shown in Figure 1 (top row) and for real world datasets are also shown in Figure 1 (bottom row). [sent-198, score-0.084]
82 To generate the plots, we vary the amount of data used in the subsampling, nsubs , from 1. [sent-199, score-0.376]
83 For FS, this simply means using a fraction of the data; for CovS and Uluru, only the data for the covariance matrix is subsampled. [sent-201, score-0.116]
84 For the real datasets we do not know the true population parameter, w0 , so we replace it with its consistent estimator wM LE , which is computed using standard OLS on the entire dataset. [sent-203, score-0.044]
85 The horizontal gray line in the figures is the overfitting point; it is the error generated by w vector of ˆ all zeros. [sent-204, score-0.037]
86 Looking at the results, we can see two trends for the synthetic data. [sent-206, score-0.036]
87 Firstly, our algorithms with no preconditioning are much faster than their counterparts with preconditioning and give similar accuracies. [sent-207, score-0.366]
88 For real world datasets also, Uluru is almost always better than the other algorithms, both with and without preconditioning. [sent-209, score-0.048]
89 00 5e−03 5e−02 5e−01 5e+00 # FLOPS/n*p Figure 1: Results for synthetic datasets (n=4096, p=8) in top row and for (PERMA, CPUSMALL, CADATA, left-right) bottom row. [sent-242, score-0.059]
90 In all settings, we varied the amount of subsampling from 1. [sent-244, score-0.302]
91 random design) and dashed lines indicate fixed design with Randomized Hadamard preconditioning. [sent-249, score-0.136]
92 solving an overdetermined linear system (w = arg minw∈Rp Xw − Y 2 ), ˆ while we are working in a statistical set up (a regression problem Y = Xβ + ) which leads to different error analysis. [sent-256, score-0.087]
93 Our FS algorithm is essentially the same as the subsampling algorithm proposed by [4]. [sent-257, score-0.282]
94 However, our theoretical analysis of it is novel, and furtheremore they only consider it in fixed design setting with Hadamard preconditioning. [sent-258, score-0.136]
95 7 Conclusion In this paper we proposed three subsampling methods for faster least squares regression. [sent-260, score-0.35]
96 Our best method, Uluru, gave an error bound which is independent of the amount of subsampling as long as it is above a threshold. [sent-262, score-0.357]
97 If one further assumes that they are sub-Gaussian (perhaps as a result of a preprocessing step, or simply because they are 0/1 or Gaussian), then subsampling methods without a Randomized Hadamard transformation suffice. [sent-267, score-0.3]
98 As shown in our experiments, dropping the Randomized Hadamard transformation significantly speeds up the algorithms and in i. [sent-268, score-0.039]
99 : Improved matrix algorithms via the subsampled randomized hadamard transform. [sent-273, score-0.457]
100 : A fast randomized algorithm for overdetermined linear least-squares regression. [sent-280, score-0.162]
wordName wordTfidf (topN-words)
[('uluru', 0.469), ('nsubs', 0.356), ('subsampling', 0.282), ('xrem', 0.262), ('covs', 0.244), ('fs', 0.204), ('xsubs', 0.187), ('hadamard', 0.165), ('preconditioning', 0.163), ('wf', 0.156), ('wcorrect', 0.15), ('ols', 0.138), ('design', 0.136), ('yrem', 0.131), ('srht', 0.116), ('randomized', 0.114), ('subsampled', 0.114), ('luru', 0.112), ('nr', 0.111), ('rem', 0.099), ('flops', 0.083), ('xx', 0.078), ('rrem', 0.075), ('ysubs', 0.075), ('transform', 0.072), ('ln', 0.07), ('subsample', 0.061), ('failure', 0.057), ('wsample', 0.056), ('xy', 0.056), ('covariance', 0.054), ('stage', 0.051), ('ungar', 0.046), ('observations', 0.045), ('matrix', 0.043), ('wu', 0.043), ('precondition', 0.041), ('nrem', 0.037), ('perma', 0.037), ('tygert', 0.037), ('wcovs', 0.037), ('xsub', 0.037), ('error', 0.037), ('np', 0.036), ('corr', 0.036), ('synthetic', 0.036), ('leverage', 0.035), ('ep', 0.035), ('foster', 0.033), ('cadata', 0.033), ('srft', 0.033), ('theorems', 0.033), ('hn', 0.032), ('squares', 0.031), ('cpusmall', 0.03), ('overdetermined', 0.03), ('vershynin', 0.03), ('dominating', 0.03), ('dhillon', 0.029), ('rn', 0.028), ('ip', 0.026), ('fourier', 0.025), ('rows', 0.025), ('world', 0.025), ('se', 0.024), ('oating', 0.024), ('xed', 0.024), ('mahoney', 0.023), ('datasets', 0.023), ('transformed', 0.023), ('covariates', 0.023), ('cpu', 0.023), ('johns', 0.022), ('hopkins', 0.022), ('september', 0.022), ('estimator', 0.021), ('algorithms', 0.021), ('regression', 0.02), ('letter', 0.02), ('big', 0.02), ('amount', 0.02), ('april', 0.019), ('supplementary', 0.019), ('material', 0.019), ('transforming', 0.019), ('fraction', 0.019), ('faster', 0.019), ('transformation', 0.018), ('three', 0.018), ('fast', 0.018), ('bound', 0.018), ('embeddings', 0.018), ('remark', 0.017), ('bounds', 0.017), ('please', 0.017), ('uniformize', 0.017), ('toledo', 0.017), ('homoskedastic', 0.017), ('omission', 0.017), ('paramveer', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar
Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1
2 0.1521543 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar
Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
3 0.068091631 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
Author: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of “Canada” and “Air” cannot be easily combined to obtain “Air Canada”. Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.
4 0.063489459 109 nips-2013-Estimating LASSO Risk and Noise Level
Author: Mohsen Bayati, Murat A. Erdogdu, Andrea Montanari
Abstract: We study the fundamental problems of variance and risk estimation in high dimensional statistical modeling. In particular, we consider the problem of learning a coefficient vector θ0 ∈ Rp from noisy linear observations y = Xθ0 + w ∈ Rn (p > n) and the popular estimation procedure of solving the 1 -penalized least squares objective known as the LASSO or Basis Pursuit DeNoising (BPDN). In this context, we develop new estimators for the 2 estimation risk θ − θ0 2 and the variance of the noise when distributions of θ0 and w are unknown. These can be used to select the regularization parameter optimally. Our approach combines Stein’s unbiased risk estimate [Ste81] and the recent results of [BM12a][BM12b] on the analysis of approximate message passing and the risk of LASSO. We establish high-dimensional consistency of our estimators for sequences of matrices X of increasing dimensions, with independent Gaussian entries. We establish validity for a broader class of Gaussian designs, conditional on a certain conjecture from statistical physics. To the best of our knowledge, this result is the first that provides an asymptotically consistent risk estimator for the LASSO solely based on data. In addition, we demonstrate through simulations that our variance estimation outperforms several existing methods in the literature. 1
5 0.061953366 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
Author: Adel Javanmard, Andrea Montanari
Abstract: Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values. We consider here a broad class of regression problems, and propose an efficient algorithm for constructing confidence intervals and p-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a ‘de-biased’ version of regularized Mestimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. Furthermore, proofs are remarkably simple. We test our method on a diabetes prediction problem. 1
6 0.060496237 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
7 0.056274779 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
8 0.052081782 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
9 0.048897795 188 nips-2013-Memory Limited, Streaming PCA
10 0.046094663 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
11 0.045682359 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
12 0.045398284 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
13 0.038182668 4 nips-2013-A Comparative Framework for Preconditioned Lasso Algorithms
14 0.037930209 91 nips-2013-Dirty Statistical Models
15 0.037705693 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
16 0.036439922 185 nips-2013-Matrix Completion From any Given Set of Observations
17 0.035618648 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
18 0.035224952 143 nips-2013-Integrated Non-Factorized Variational Inference
19 0.034442328 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
20 0.033155777 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
topicId topicWeight
[(0, 0.096), (1, 0.034), (2, 0.031), (3, 0.018), (4, 0.005), (5, 0.019), (6, -0.01), (7, 0.006), (8, -0.063), (9, 0.013), (10, 0.001), (11, 0.019), (12, -0.05), (13, -0.012), (14, -0.012), (15, -0.014), (16, 0.017), (17, 0.045), (18, -0.053), (19, 0.022), (20, -0.011), (21, -0.081), (22, -0.042), (23, 0.025), (24, -0.02), (25, 0.081), (26, -0.026), (27, 0.031), (28, -0.02), (29, -0.069), (30, 0.005), (31, -0.048), (32, -0.022), (33, 0.023), (34, 0.027), (35, -0.03), (36, -0.063), (37, -0.067), (38, 0.087), (39, -0.077), (40, -0.034), (41, -0.122), (42, 0.038), (43, -0.028), (44, -0.137), (45, -0.041), (46, -0.072), (47, -0.058), (48, 0.011), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.87688267 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar
Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1
2 0.72203743 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar
Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
3 0.57922226 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
Author: Daniel Bartz, Klaus-Robert Müller
Abstract: Analytic shrinkage is a statistical technique that offers a fast alternative to crossvalidation for the regularization of covariance matrices and has appealing consistency properties. We show that the proof of consistency requires bounds on the growth rates of eigenvalues and their dispersion, which are often violated in data. We prove consistency under assumptions which do not restrict the covariance structure and therefore better match real world data. In addition, we propose an extension of analytic shrinkage –orthogonal complement shrinkage– which adapts to the covariance structure. Finally we demonstrate the superior performance of our novel approach on data from the domains of finance, spoken letter and optical character recognition, and neuroscience. 1
4 0.54590231 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
Author: Haim Avron, Vikas Sindhwani, David Woodruff
Abstract: Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. 1
5 0.53050989 225 nips-2013-One-shot learning and big data with n=2
Author: Lee H. Dicker, Dean P. Foster
Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1
6 0.51868588 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
7 0.48870799 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
8 0.4831275 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
9 0.47780713 76 nips-2013-Correlated random features for fast semi-supervised learning
10 0.46434486 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
11 0.45120731 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
12 0.43849671 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
13 0.43303072 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
14 0.4257955 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
15 0.42446354 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
16 0.42063117 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
17 0.41920808 109 nips-2013-Estimating LASSO Risk and Noise Level
18 0.40381587 188 nips-2013-Memory Limited, Streaming PCA
19 0.40283367 186 nips-2013-Matrix factorization with binary components
20 0.39171168 327 nips-2013-The Randomized Dependence Coefficient
topicId topicWeight
[(2, 0.015), (16, 0.05), (33, 0.123), (34, 0.072), (41, 0.024), (49, 0.018), (50, 0.35), (56, 0.094), (70, 0.011), (85, 0.043), (89, 0.056), (93, 0.037)]
simIndex simValue paperId paperTitle
same-paper 1 0.67940533 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression
Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar
Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1
2 0.6343236 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
Author: Michael Hughes, Erik Sudderth
Abstract: Variational inference algorithms provide the most effective framework for largescale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet finite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains finite-dimensional sufficient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for infinite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.
3 0.60469824 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
Author: Khaled Refaat, Arthur Choi, Adnan Darwiche
Abstract: EDML is a recently proposed algorithm for learning parameters in Bayesian networks. It was originally derived in terms of approximate inference on a metanetwork, which underlies the Bayesian approach to parameter estimation. While this initial derivation helped discover EDML in the first place and provided a concrete context for identifying some of its properties (e.g., in contrast to EM), the formal setting was somewhat tedious in the number of concepts it drew on. In this paper, we propose a greatly simplified perspective on EDML, which casts it as a general approach to continuous optimization. The new perspective has several advantages. First, it makes immediate some results that were non-trivial to prove initially. Second, it facilitates the design of EDML algorithms for new graphical models, leading to a new algorithm for learning parameters in Markov networks. We derive this algorithm in this paper, and show, empirically, that it can sometimes learn estimates more efficiently from complete data, compared to commonly used optimization methods, such as conjugate gradient and L-BFGS. 1
4 0.53206617 99 nips-2013-Dropout Training as Adaptive Regularization
Author: Stefan Wager, Sida Wang, Percy Liang
Abstract: Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset. 1
5 0.47634277 233 nips-2013-Online Robust PCA via Stochastic Optimization
Author: Jiashi Feng, Huan Xu, Shuicheng Yan
Abstract: Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust PCA (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the number of samples, significantly enhancing the computation and storage efficiency. The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods. 1
6 0.47393897 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
7 0.473728 350 nips-2013-Wavelets on Graphs via Deep Learning
8 0.47337461 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
9 0.47283596 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
10 0.47235778 232 nips-2013-Online PCA for Contaminated Data
11 0.47182655 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
12 0.47164103 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
13 0.47095197 355 nips-2013-Which Space Partitioning Tree to Use for Search?
14 0.47077036 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
15 0.47012502 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
16 0.46944106 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
17 0.46899927 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
18 0.46874413 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
19 0.46858606 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements
20 0.46827433 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization