nips nips2013 nips2013-120 knowledge-graph by maker-knowledge-mining

120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform


Source: pdf

Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar

Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). [sent-10, score-0.266]

2 The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). [sent-11, score-0.287]

3 Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. [sent-12, score-0.16]

4 We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. [sent-13, score-0.177]

5 1 Introduction Ridge Regression, which penalizes the 2 norm of the weight vector and shrinks it towards zero, is the most widely used penalized regression method. [sent-14, score-0.074]

6 It is of particular interest in the p > n case (p is the number of features and n is the number of observations), as the standard ordinary least squares regression (OLS) breaks in this setting. [sent-15, score-0.098]

7 Thus efficient algorithms to solve ridge regression are highly desirable. [sent-17, score-0.219]

8 The current method of choice for efficiently solving RR is [19], which works in the dual space and has a running time of O(n2 p), which can be slow for huge p. [sent-18, score-0.119]

9 As the runtime suggests, the bottleneck is the computation of XX where X is the design matrix. [sent-19, score-0.064]

10 For example, suppose X has rank k, if we randomly subsample psubs of the p (k < psubs p ) features, then the matrix multiplication can be performed in O(n2 psubs ) time, which is very fast! [sent-21, score-2.045]

11 This approach can also be viewed as preconditioning the design matrix with a carefully constructed data-independent random matrix. [sent-27, score-0.123]

12 1 In this paper, we build on the above line of research and provide a fast algorithm for ridge regression (RR) which applies a Randomized Hadamard transform to the columns of the X matrix and then samples psubs = O(n) columns. [sent-29, score-1.043]

13 This allows the bottleneck matrix multiplication in the dual RR to be computed in O(np log(n)) time, so we call our algorithm Subsampled Randomized Hadamard Transform-Dual Ridge Regression (SRHT-DRR). [sent-30, score-0.16]

14 In addition to being computationally efficient, we also prove that in the fixed design setting SRHTDRR only increases the risk by a factor of (1 + C w. [sent-31, score-0.152]

15 1 k psubs ) (where k is the rank of the data matrix) Related Work Using randomized algorithms to handle large matrices is an active area of research, and has been used in a variety of setups. [sent-36, score-0.86]

16 Most of these algorithms involve a step that randomly projects the original large matrix down to lower dimensions [9, 16, 8]. [sent-37, score-0.051]

17 However, computing a random projection is still expensive as it requires multiplying a huge data matrix by another random dense matrix. [sent-41, score-0.098]

18 [18] introduced the idea of using structured random projection for making matrix multiplication substantially faster. [sent-42, score-0.074]

19 Recently, several randomized algorithms have been developed for kernel approximation. [sent-43, score-0.234]

20 [3] provided a fast method for low rank kernel approximation by randomly selecting q samples to construct a rank q approximation of the original kernel matrix. [sent-44, score-0.162]

21 Their approximation can reduce the cost to O(nq 2 ). [sent-45, score-0.045]

22 Also, it is worth distinguishing our setup from standard kernel learning. [sent-48, score-0.062]

23 Therefore, in this paper we propose a randomized scheme to reduce the dimension of X and accelerate the computation of XX . [sent-51, score-0.192]

24 2 Faster Ridge Regression via SRHT In this section we firstly review the traditional solution of solving RR in the dual and it’s computational cost. [sent-52, score-0.118]

25 1 Ridge Regression Let X be the n × p design matrix containing n i. [sent-55, score-0.097]

26 The step that dominates the computational cost is the matrix inversion which takes O(p3 ) flops and will be extremely slow when p n 1. [sent-67, score-0.157]

27 A straight forward improvement to this is to solve the Equation (1) in the dual space. [sent-68, score-0.068]

28 Please see [19] for a ˆ ˆ detailed derivation of this dual solution. [sent-71, score-0.068]

29 In the p n case the step that dominates computational cost in the dual solution is computing the linear kernel matrix K = XX which takes O(n2 p) flops. [sent-72, score-0.274]

30 This is regarded as the computational cost of the true RR solution in our setup. [sent-73, score-0.123]

31 The Walsh-Hadamard matrix of size Hp/2 Hp/2 +1 +1 with H2 = . [sent-78, score-0.051]

32 • D is a p × p diagonal matrix and the diagonal elements are i. [sent-80, score-0.051]

33 Firstly, due to the recursive structure of the H matrix, it takes only O(p log(psubs )) FLOPS to compute Θv where v is a generic p × 1 dense vector while for arbitrary unstructured psubs × p dense matrix A, the cost for computing Av is O(psubs p) flops. [sent-85, score-0.772]

34 Secondly, after projecting any matrix W ∈ p × k with orthonormal columns down to low dimensions with SRHT, the columns of ΘW ∈ psubs × k are still about orthonormal. [sent-86, score-0.748]

35 Let W be an p × k (p > k) matrix where W W = Ik . [sent-88, score-0.051]

36 Let Θ be a psubs × p SRHT matrix where p > psubs > k. [sent-89, score-1.335]

37 Then with probability at least 1 − (δ + ep ), k (ΘW) ΘW − Ik 2 ≤ c log( 2k )k δ psubs (3) The bound is in terms of the spectral norm of the matrix. [sent-90, score-0.667]

38 The tools for the random matrix theory part of the proof come from [20] and [21]. [sent-92, score-0.051]

39 3 The Algorithm Our fast algorithm for SRHT-DRR is described below: SRHT-DRR Input: Dataset X ∈ n × p, response Y ∈ n × 1, and subsampling size psubs . [sent-95, score-0.705]

40 • Compute βH,λ = XH αH,λ 3 Since, SRHT is only defined for p = 2q for any integer q, so, if the dimension p is not a power of 2, we can concatenate a block of zero matrix to the feature matrix X to make the dimension a power of 2. [sent-99, score-0.102]

41 Once we have XH , computing αH,λ costs O(n2 psubs ) FLOPS, with the dominating step being computing KH = XH XH . [sent-103, score-0.661]

42 So the computational cost for computing αH,λ is O(np log(psubs ) + n2 psubs ), compared to the true RR which costs O(n2 p). [sent-104, score-0.737]

43 We will discuss how large psubs should be later after stating the main theorem. [sent-105, score-0.642]

44 3 Theory In this section we bound the risk of SRHT-DRR and compare it with the risk of the true dual ridge estimator in fixed design setting. [sent-106, score-0.518]

45 As earlier, let X be an arbitrary n × p design matrix such that p n. [sent-107, score-0.097]

46 [5] and [3] did similar analysis for the risk of RR under similar fixed design setups. [sent-109, score-0.152]

47 The risk for any prediction Y ∈ n × 1 is n E Y − Z 2 . [sent-127, score-0.106]

48 For any n × n positive symmetric definite matrix M, define the following risk function. [sent-128, score-0.157]

49 Under the fixed design setting, the risk for the true RR solution is R(K) and the risk for SRHT-DRR is R(KH ). [sent-130, score-0.314]

50 In the risk function, the first term is the variance and the second term is the bias. [sent-134, score-0.106]

51 2 Risk Inflation Bound The following theorem bounds the risk inflation of SRHT-DRR compared with the true RR solution. [sent-136, score-0.134]

52 With probability at least 1 − (δ + ep ) k R(KH ) ≤ (1 − ∆)−2 R(K) where ∆ = C (7) k log(2k/δ) psubs Proof. [sent-139, score-0.667]

53 Theorem 1 gives us an idea of how large psubs should be. [sent-146, score-0.642]

54 Assuming ∆ (the risk inflation ratio) is fixed, we get psubs = C k log(2k/δ) = O(k). [sent-147, score-0.748]

55 ∆2 k = n, then, it suffices to choose psubs = O(n). [sent-150, score-0.642]

56 Combining this with Remark 1, we can see that the cost of computing XH is O(np log(n)). [sent-151, score-0.045]

57 Hence, under the ideal setup where p is huge so that the dominating step of SRHT-DRR is computing XH , the computational cost of SRHT-DRR O(np log(n)) FLOPS. [sent-152, score-0.136]

58 5 Comparison with PCA Another way to handle high dimensional features is to use PCA and run regression only on the top few principal components (this procedure is called PCR), as illustrated by [13] and many other papers. [sent-153, score-0.102]

59 Recently, [5] have shown that the risk of PCR and RR are related and that the risk of PCR is bounded by four times the risk of RR. [sent-156, score-0.318]

60 Therefore the eigenvectors of the sample covariance matrix obtained by PCA maybe very different from the truth. [sent-160, score-0.051]

61 An alternative is to use a randomized algorithm such as [16] or [9] to compute PCA. [sent-163, score-0.192]

62 Again, whether randomized PCA is better than our SRHT-DRR algorithm depends on the problem. [sent-164, score-0.192]

63 4 Experiments In this section we show experimental results on synthetic as well as real-world data highlighting the merits of SRHT, namely, lower computational cost compared to the true Ridge Regression (RR) solution, without any significant loss of accuracy. [sent-166, score-0.12]

64 We also compare our approach against “standard” PCA as well as randomized PCA [16]. [sent-167, score-0.192]

65 As far as PCA algorithms are concerned, we implemented standard PCA using the built in SVD function in MATLAB and for randomized PCA we used the block power iteration like approach proposed by [16]. [sent-169, score-0.192]

66 We always achieved convergence in three power iterations of randomized PCA. [sent-170, score-0.192]

67 1 Measures of Performance Since we know the true β which generated the synthetic data, we report MSE/Risk for the fixed design setting (they are equivalent for squared loss) as measure of accuracy. [sent-172, score-0.099]

68 In order to compare the computational cost of SHRT-DRR with true RR, we need to estimate the number of FLOPS used by them. [sent-175, score-0.095]

69 [4, 6], the theoretical cost of applying Randomized Hadamard Transform is O(np log(psubs )). [sent-178, score-0.045]

70 However, the MATLAB implementation we used took about np log(p) FLOPS to compute XH . [sent-179, score-0.077]

71 So, for SRHT-DRR, the total computational cost is np log(p) for getting XH and a further 2n2 psubs FLOPS to compute KH . [sent-180, score-0.786]

72 As mentioned earlier, the true dual RR solution takes ≈ 2n2 p. [sent-181, score-0.124]

73 So, in our experiments, we report relative computational cost which is computed as the ratio of the two. [sent-182, score-0.087]

74 2 p log(p) · n + 2n2 psubs 2n2 p Synthetic Data We generated synthetic data with p = 8192 and varied the number of observations n = 20, 100, 200. [sent-184, score-0.667]

75 We generated a n × n matrix R ∼ M V N (0, I) where MVN(µ, Σ) is the Multivariate Normal Distribution with mean vector µ, variance-covariance matrix Σ and βj ∼ N (0, 1) ∀j = 1, . [sent-185, score-0.102]

76 The final X matrix was generated by rotating R with a randomly generated n × p rotation matrix. [sent-189, score-0.051]

77 The boxplots show the median error rates for SRHT-DRR for different psubs . [sent-244, score-0.717]

78 The solid red line is the median error rate for the true RR using all the features. [sent-245, score-0.101]

79 The green line is the median error rate for PCR when PCA is computed by SVD in MATLAB. [sent-246, score-0.073]

80 The black dashed line is median error rate for PCR when PCA is computed by randomized PCA. [sent-247, score-0.265]

81 For PCA and randomized PCA, we tried keeping r PCs in the range 10 to n and finally chose the value of r which gave the minimum error on the training set. [sent-248, score-0.237]

82 We tried 10 different values for psubs from n + 10 to 2000 . [sent-249, score-0.671]

83 Firstly, in all the cases, SRHT-DRR gets very close in accuracy to the true RR with only ≈ 30% of its computational cost. [sent-253, score-0.05]

84 For PCA and randomized PCA, we tried keeping r = 10, 20, 30, 40, 50, 60, 70, 80, 90 PCs and finally chose the value of r which gave the minimum error on the training set (r = 30). [sent-263, score-0.237]

85 As earlier, we tried 10 different values for psubs : 150, 250, 400, 600, 800, 1000, 1200, 1600, 2000, 2500. [sent-264, score-0.671]

86 Randomized PCA is fast but less accurate than standard (“true”) PCA; its computational cost for r = 30 can be approximately calculated as about 240np (see [9] for details), which in this case is roughly the same as computing XX (≈ 2n2 p). [sent-266, score-0.093]

87 As can be seen, SRHT-DRR comes very close in accuracy to the true RR solution with just ≈ 30% of its computational cost. [sent-268, score-0.078]

88 SRHT-DRR beats PCA and Randomized PCA even more comprehensively, achieving the same or better accuracy at just ≈ 18% of their computational cost. [sent-269, score-0.048]

89 5 Conclusion In this paper we proposed a fast algorithm, SRHT-DRR, for ridge regression in the p n 1 setting SRHT-DRR preconditions the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. [sent-270, score-0.379]

90 In addition to being significantly faster than the true dual ridge regression solution, SRHT-DRR only inflates the risk w. [sent-271, score-0.439]

91 365 Relative Computational Cost Figure 2: The boxplots show the median error rates for SRHT-DRR for different psubs . [sent-294, score-0.717]

92 The solid red line is the median error rate for the true RR using all the features. [sent-295, score-0.101]

93 The green line is the median error rate for PCR with top 30 PCs when PCA is computed by SVD in MATLAB. [sent-296, score-0.073]

94 The black dashed line is the median error rate for PCR with the top 30 PCs computed by randomized PCA. [sent-297, score-0.265]

95 Fast dimension reduction using rademacher series on dual bch codes. [sent-302, score-0.087]

96 Improved matrix algorithms via the subsampled randomized hadamard transform. [sent-309, score-0.441]

97 A risk comparison of ordinary least squares vs ridge regression. [sent-317, score-0.292]

98 Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. [sent-334, score-0.051]

99 Analysis of a randomized approximation scheme for matrix multiplication. [sent-343, score-0.243]

100 A fast randomized algorithm for overdetermined linear least-squares regression. [sent-374, score-0.218]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('psubs', 0.642), ('kh', 0.402), ('rr', 0.279), ('pca', 0.234), ('randomized', 0.192), ('srht', 0.189), ('ridge', 0.164), ('xh', 0.156), ('pcr', 0.123), ('hadamard', 0.111), ('flops', 0.11), ('risk', 0.106), ('subsampled', 0.087), ('np', 0.077), ('dual', 0.068), ('transform', 0.062), ('xx', 0.056), ('regression', 0.055), ('matrix', 0.051), ('pcs', 0.05), ('median', 0.049), ('corr', 0.048), ('design', 0.046), ('cost', 0.045), ('svd', 0.044), ('kernel', 0.042), ('ation', 0.039), ('hp', 0.039), ('subsampling', 0.037), ('firstly', 0.036), ('arcene', 0.036), ('tamas', 0.036), ('homoskedastic', 0.031), ('paramveer', 0.031), ('ik', 0.031), ('huge', 0.03), ('tried', 0.029), ('saunders', 0.029), ('ailon', 0.029), ('lyle', 0.029), ('rokhlin', 0.029), ('true', 0.028), ('solution', 0.028), ('halko', 0.027), ('martinsson', 0.027), ('mark', 0.027), ('fast', 0.026), ('preconditioning', 0.026), ('boxplots', 0.026), ('beats', 0.026), ('rank', 0.026), ('principal', 0.026), ('synthetic', 0.025), ('ep', 0.025), ('nir', 0.025), ('secondly', 0.024), ('ols', 0.024), ('line', 0.024), ('multiplication', 0.023), ('ud', 0.023), ('joel', 0.023), ('vladimir', 0.022), ('computational', 0.022), ('ordinary', 0.022), ('sham', 0.022), ('tr', 0.022), ('features', 0.021), ('foster', 0.021), ('earlier', 0.021), ('slow', 0.021), ('setup', 0.02), ('relative', 0.02), ('du', 0.02), ('log', 0.02), ('columns', 0.019), ('subsample', 0.019), ('dominating', 0.019), ('rademacher', 0.019), ('multiply', 0.019), ('shrinks', 0.019), ('lemma', 0.019), ('parallel', 0.019), ('dhillon', 0.019), ('faster', 0.018), ('dominates', 0.018), ('bottleneck', 0.018), ('dense', 0.017), ('individuals', 0.017), ('orthonormal', 0.017), ('dean', 0.017), ('fourier', 0.016), ('remark', 0.016), ('gave', 0.016), ('kakade', 0.016), ('kills', 0.016), ('ates', 0.016), ('nq', 0.016), ('edo', 0.016), ('isabelle', 0.016), ('omission', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar

Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1

2 0.18787648 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities defined on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difficult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classification within the covariate shift framework. 1

3 0.1521543 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar

Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1

4 0.13359459 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

Author: Yuhong Guo

Abstract: Principal component analysis (PCA), a well-established technique for data analysis and processing, provides a convenient form of dimensionality reduction that is effective for cleaning small Gaussian noises presented in the data. However, the applicability of standard principal component analysis in real scenarios is limited by its sensitivity to large errors. In this paper, we tackle the challenge problem of recovering data corrupted with errors of high magnitude by developing a novel robust transfer principal component analysis method. Our method is based on the assumption that useful information for the recovery of a corrupted data matrix can be gained from an uncorrupted related data matrix. Specifically, we formulate the data recovery problem as a joint robust principal component analysis problem on the two data matrices, with common principal components shared across matrices and individual principal components specific to each data matrix. The formulated optimization problem is a minimization problem over a convex objective function but with non-convex rank constraints. We develop an efficient proximal projected gradient descent algorithm to solve the proposed optimization problem with convergence guarantees. Our empirical results over image denoising tasks show the proposed method can effectively recover images with random large errors, and significantly outperform both standard PCA and robust PCA with rank constraints. 1

5 0.11265843 232 nips-2013-Online PCA for Contaminated Data

Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan

Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1

6 0.09394294 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

7 0.092372015 225 nips-2013-One-shot learning and big data with n=2

8 0.091974758 233 nips-2013-Online Robust PCA via Stochastic Optimization

9 0.091264762 188 nips-2013-Memory Limited, Streaming PCA

10 0.065899082 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach

11 0.064416416 224 nips-2013-On the Sample Complexity of Subspace Learning

12 0.063805871 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

13 0.057585679 186 nips-2013-Matrix factorization with binary components

14 0.052023128 72 nips-2013-Convex Calibrated Surrogates for Low-Rank Loss Matrices with Applications to Subset Ranking Losses

15 0.051459316 109 nips-2013-Estimating LASSO Risk and Noise Level

16 0.049816482 91 nips-2013-Dirty Statistical Models

17 0.048904784 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

18 0.047123134 251 nips-2013-Predicting Parameters in Deep Learning

19 0.045903232 171 nips-2013-Learning with Noisy Labels

20 0.04566462 340 nips-2013-Understanding variable importances in forests of randomized trees


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, 0.056), (2, 0.064), (3, 0.044), (4, 0.0), (5, 0.041), (6, -0.053), (7, 0.066), (8, -0.128), (9, -0.018), (10, -0.047), (11, 0.082), (12, 0.015), (13, 0.082), (14, 0.065), (15, 0.029), (16, 0.001), (17, 0.016), (18, -0.001), (19, -0.028), (20, 0.014), (21, -0.025), (22, -0.012), (23, -0.028), (24, 0.026), (25, 0.072), (26, 0.051), (27, 0.088), (28, 0.049), (29, -0.033), (30, 0.067), (31, -0.009), (32, -0.047), (33, -0.009), (34, 0.033), (35, -0.126), (36, -0.091), (37, -0.066), (38, 0.005), (39, -0.09), (40, 0.009), (41, -0.159), (42, 0.046), (43, -0.052), (44, -0.223), (45, -0.047), (46, -0.138), (47, -0.061), (48, 0.096), (49, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91319329 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar

Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1

2 0.64824188 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar

Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1

3 0.56535047 232 nips-2013-Online PCA for Contaminated Data

Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan

Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1

4 0.52873582 225 nips-2013-One-shot learning and big data with n=2

Author: Lee H. Dicker, Dean P. Foster

Abstract: We model a “one-shot learning” situation, where very few observations y1 , ..., yn ∈ R are available. Associated with each observation yi is a very highdimensional vector xi ∈ Rd , which provides context for yi and enables us to predict subsequent observations, given their own context. One of the salient features of our analysis is that the problems studied here are easier when the dimension of xi is large; in other words, prediction becomes easier when more context is provided. The proposed methodology is a variant of principal component regression (PCR). Our rigorous analysis sheds new light on PCR. For instance, we show that classical PCR estimators may be inconsistent in the specified setting, unless they are multiplied by a scalar c > 1; that is, unless the classical estimator is expanded. This expansion phenomenon appears to be somewhat novel and contrasts with shrinkage methods (c < 1), which are far more common in big data analyses. 1

5 0.51328629 233 nips-2013-Online Robust PCA via Stochastic Optimization

Author: Jiashi Feng, Huan Xu, Shuicheng Yan

Abstract: Robust PCA methods are typically based on batch optimization and have to load all the samples into memory during optimization. This prevents them from efficiently processing big data. In this paper, we develop an Online Robust PCA (OR-PCA) that processes one sample per time instance and hence its memory cost is independent of the number of samples, significantly enhancing the computation and storage efficiency. The proposed OR-PCA is based on stochastic optimization of an equivalent reformulation of the batch RPCA. Indeed, we show that OR-PCA provides a sequence of subspace estimations converging to the optimum of its batch counterpart and hence is provably robust to sparse corruption. Moreover, OR-PCA can naturally be applied for tracking dynamic subspace. Comprehensive simulations on subspace recovering and tracking demonstrate the robustness and efficiency advantages of the OR-PCA over online PCA and batch RPCA methods. 1

6 0.49830246 188 nips-2013-Memory Limited, Streaming PCA

7 0.49623013 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

8 0.48532629 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

9 0.47514635 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

10 0.46044806 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

11 0.45612741 314 nips-2013-Stochastic Optimization of PCA with Capped MSG

12 0.43533298 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration

13 0.42850515 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures

14 0.42821309 256 nips-2013-Probabilistic Principal Geodesic Analysis

15 0.41929209 76 nips-2013-Correlated random features for fast semi-supervised learning

16 0.39896691 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

17 0.35097855 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

18 0.34999505 224 nips-2013-On the Sample Complexity of Subspace Learning

19 0.34992847 186 nips-2013-Matrix factorization with binary components

20 0.33186635 146 nips-2013-Large Scale Distributed Sparse Precision Estimation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.014), (16, 0.063), (33, 0.125), (34, 0.094), (36, 0.015), (41, 0.025), (47, 0.01), (49, 0.012), (56, 0.082), (70, 0.011), (85, 0.051), (89, 0.055), (93, 0.052), (95, 0.029), (97, 0.259)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75111771 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar

Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1

2 0.70561284 186 nips-2013-Matrix factorization with binary components

Author: Martin Slawski, Matthias Hein, Pavlo Lutsik

Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. To obtain this result, we use theory around the Littlewood-Offord lemma from combinatorics.

3 0.69836855 252 nips-2013-Predictive PAC Learning and Process Decompositions

Author: Cosma Shalizi, Aryeh Kontorovitch

Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1

4 0.60580945 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

Author: Cho-Jui Hsieh, Matyas A. Sustik, Inderjit Dhillon, Pradeep Ravikumar, Russell Poldrack

Abstract: The 1 -regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm B IG QUIC, which can solve 1 million dimensional 1 regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that B IG QUIC can achieve super-linear or even quadratic convergence rates. 1

5 0.60565412 350 nips-2013-Wavelets on Graphs via Deep Learning

Author: Raif Rustamov, Leonidas Guibas

Abstract: An increasing number of applications require processing of signals defined on weighted graphs. While wavelets provide a flexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less flexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data. 1

6 0.60176194 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

7 0.60070103 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

8 0.59980571 232 nips-2013-Online PCA for Contaminated Data

9 0.59907448 201 nips-2013-Multi-Task Bayesian Optimization

10 0.59868097 99 nips-2013-Dropout Training as Adaptive Regularization

11 0.59747952 233 nips-2013-Online Robust PCA via Stochastic Optimization

12 0.59725553 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

13 0.59472424 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

14 0.59388852 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

15 0.59360081 234 nips-2013-Online Variational Approximations to non-Exponential Family Change Point Models: With Application to Radar Tracking

16 0.59347886 173 nips-2013-Least Informative Dimensions

17 0.59346753 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data

18 0.59332705 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

19 0.59296083 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

20 0.59257144 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks