nips nips2013 nips2013-297 knowledge-graph by maker-knowledge-mining

297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression


Source: pdf

Author: Haim Avron, Vikas Sindhwani, David Woodruff

Abstract: Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. [sent-8, score-0.421]

2 These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. [sent-9, score-0.338]

3 We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. [sent-10, score-0.267]

4 We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. [sent-11, score-0.119]

5 1 Introduction Recent literature has advocated the use of randomization as a key algorithmic device with which to dramatically accelerate statistical learning with lp regression or low-rank matrix approximation techniques [12, 6, 8, 10]. [sent-12, score-0.31]

6 Consider the following class of regression problems, arg min Zx − b p , where p = 1, 2 x∈C (1) where C is a convex constraint set, Z ∈ Rn×k is a sample-by-feature design matrix, and b ∈ Rn is the target vector. [sent-13, score-0.216]

7 Similarly, a randomized solver for l1 regression runs in time O(nk log n) + poly(k −1 ) [5]. [sent-18, score-0.269]

8 In many settings, what makes such acceleration possible is the existence of a suitable oblivious subspace embedding (OSE). [sent-19, score-0.173]

9 An OSE can be thought of as a data-independent random “sketching” matrix S ∈ Rt×n whose approximate isometry properties over a subspace (e. [sent-20, score-0.15]

10 Sketching matrices include Gaussian random matrices, structured random matrices which admit fast matrix multiplication via FFT-like operations, and others. [sent-23, score-0.321]

11 Clarkson and Woodruff have recently shown that when Z is sparse and has nnz(Z) nk non-zeros, it is possible to achieve much faster “input-sparsity” runtime using hashing-based sketching matrices [7]. [sent-25, score-0.385]

12 ◦ Can faster and more accurate sketching techniques be designed for nonlinear and nonparametric regression? [sent-27, score-0.313]

13 To see that this is intertwined with the previous question, consider the basic problem q of fitting a polynomial model, b = i=1 βi z i to a set of samples (zi , bi ) ∈ R × R, i = 1, . [sent-28, score-0.095]

14 Then, the design matrix Z has Vandermonde structure which can potentially be exploited in a regression solver. [sent-32, score-0.29]

15 xq−1 0 1 n−1 Vandermonde matrices of dimension q × n require only O(n) implicit storage and admit O((n + q) log2 q) matrix-vector multiplication time. [sent-67, score-0.118]

16 We also define the following matrix operator Tq which maps a matrix A to a block-Vandermonde structured matrix. [sent-68, score-0.252]

17 Definition 2 (Matrix Operator) Given a matrix A ∈ Rn×d , we define the following matrix: Tq (A) = Vq,n (A1,1 , . [sent-69, score-0.074]

18 , An,d )T In this paper, we consider regression problems, Eqn. [sent-78, score-0.169]

19 1, where Z can be written as Z = Tq (A) (2) for an n × d matrix A, so that k = dq. [sent-79, score-0.074]

20 The operator Tq expands each feature (column) of the original dataset A to q columns of Z by applying monomial transformations upto degree q − 1. [sent-80, score-0.111]

21 Such structure naturally arises in polynomial regression problems, but also applies more broadly to non-parametric additive models and kernel methods as we discuss below. [sent-82, score-0.39]

22 Our contributions in this paper are as follows: ◦ For p = 2, we provide an algorithm that solves the structured regression problem above in time O(nnz(A) log2 q) + poly(dq −1 ). [sent-84, score-0.271]

23 By combining our sketching methods with preconditioned iterative solvers, we can also obtain logarithmic dependence on . [sent-85, score-0.238]

24 For p = 1, we provide an algorithm with runtime O(nnz(A) log n log2 q) + poly(dq −1 log n). [sent-86, score-0.083]

25 e, Z = A) to nonlinear regression (Z = Tq (A))) incurs only a mild additional log2 q runtime cost, while requiring no extra storage! [sent-88, score-0.233]

26 Since nnz(Tq (A)) = q nnz(A), this provides - to our knowledge - the first sketching approach that operates faster than “input-sparsity” time, i. [sent-89, score-0.28]

27 we sketch Tq (A) in time faster than nnz(Tq (A)). [sent-91, score-0.095]

28 ◦ Our algorithms apply to a broad class of nonlinear models for both least squares regression and their robust l1 regression counterparts. [sent-92, score-0.398]

29 We argue that our approach provides a more flexible modeling framework when compared to randomized Fourier maps for kernel methods [16, 11]. [sent-94, score-0.148]

30 2 Polynomial Fitting, Additive Models and Random Fourier Maps Our primary goal in this section is to motivate sketching approaches for a versatile class of BlockVandermonde structured regression problems by showing that these problems arise naturally in various statistical modeling settings. [sent-96, score-0.483]

31 The most basic application is the one-dimensional (d = 1) polynomial regression. [sent-97, score-0.095]

32 In multivariate additive regression models, a continuous target variable y ∈ R and input variables d z ∈ Rd are related through the model y = µ + i=1 fi (zi ) + i where µ is an intercept term, i are zero-mean Gaussian error terms and fi are smooth univariate functions. [sent-98, score-0.295]

33 β dq ]T typically by a constrained or penalized least squares model, argminx∈C Zx − b 2 where b = (y1 . [sent-105, score-0.354]

34 It is easy to see that choosing a monomial basis hi,s (u) = us immediately maps the design matrix Z to the structured regression form of Eqn. [sent-116, score-0.422]

35 For p = 1, our algorithms also provide fast solvers for robust polynomial additive models. [sent-118, score-0.235]

36 A degree-q multivariate polynomial function space Pq is spanned by {z α , α ∈ {0, . [sent-130, score-0.095]

37 Pq admits all possible degree-q interactions but has dimensionality dq which is computationally infeasible to explicitly work with except for low-degrees and low-dimensional or sparse datasets [3]. [sent-134, score-0.327]

38 Kernel methods with polynomial kernels q k(z, z ) = z T z = α z α z α provide an implicit mechanism to compute inner products in the feature space associated with Pq . [sent-135, score-0.095]

39 However, they require O(n3 ) computation for solving associated kernelized (ridge) regression problems and O(n2 ) storage of dense n × n Gram matrices K (given by Kij = k(zi , zj )), and therefore do not scale well. [sent-136, score-0.262]

40 For a d × D matrix G let SG be the subspace spanned by   t  d  Gij zi , t = 1 . [sent-137, score-0.188]

41   i=1 Assuming D = dq and that G is a random matrix of i. [sent-144, score-0.401]

42 An intuitively appealing explicit scalable approach is then to use D dq . [sent-147, score-0.353]

43 In that case SG essentially spans a random subspace of Pq . [sent-148, score-0.076]

44 The design matrix for solving the multivariate T T polynomial regression restricted to SG has the form Z = Tq (AG) where A = [z1 . [sent-149, score-0.361]

45 This scheme can be in fact related to the idea of random Fourier features introduced by Rahimi and Recht [16] in the context of approximating shift-invariant kernel functions, with the Gaussian Kernel k(z, z ) = exp (− z − z 2 /2σ 2 ) as the primary example. [sent-153, score-0.098]

46 This implies that the Gram matrix of the Gaussian kernel, Kij = exp (− zi − zj 2 /2σ 2 ) may be approximated 2 with high concentration as K ≈ RRT where R = [cos(AG) sin(AG)] ∈ Rn×2D (sine and cosine are applied elementwise as scalar functions). [sent-156, score-0.161]

47 This randomized explicit feature mapping for the Gaussian kernel implies that standard linear regression, with R as the design matrix, can then be used to obtain a solution in time O(nD2 ). [sent-157, score-0.143]

48 3 Fast Structured Regression with Sketching We now develop our randomized solvers for block-Vandermonde structured lp regression problems. [sent-161, score-0.38]

49 In the theoretical developments below, we consider unconstrained regression though our results generalize straightforwardly to convex constraint sets C. [sent-162, score-0.193]

50 One can always repeat the regression procedure O(log(1/δ)) times, each time with independent randomness, and choose the best solution found. [sent-164, score-0.169]

51 Well-Conditioning and Sampling of A Matrix Definition 3 ((α, β, 1)-well-conditioning [8]) Given a matrix M ∈ Rn×d , we say M is (α, β, 1)well-conditioned if (1) x ∞ ≤ β M x 1 for any x ∈ Rd , and (2) M 1 ≤ α. [sent-184, score-0.074]

52 Lemma 4 (Implicit in [20]) Suppose S is an r × n matrix so that for all x ∈ Rd , Mx 1 ≤ SM x 1 ≤ κ M x 1. [sent-185, score-0.074]

53 2 of [8]) Suppose U is an (α, β, 1)-well-conditioned basis of an n × d matrix A. [sent-189, score-0.074]

54 U 1 ε δ Suppose we independently sample each row with probability pi , and create a diagonal matrix S where Si,i = 0 if i is not sampled, and Si,i = 1/pi if i is sampled. [sent-191, score-0.135]

55 Theorem 6 Let U ∈ Rn×d be an (α, β, 1)-well-conditioned basis of an n × d matrix A. [sent-195, score-0.074]

56 • Φ ∈ {0, 1}t×n is a t × n binary matrix with Φh(i),i = 1, and all remaining entries 0. [sent-211, score-0.074]

57 We will also use an oblivious subspace embedding known as the subsampled randomized Hadamard transform, or SRHT. [sent-217, score-0.283]

58 99 over the choice of Π , for any fixed A ∈ Rn×d , we have simultaneously for all x ∈ Rd , (1 − ε) · Ax ≤ (1 + ε) · Ax 2 , √ √ where the number of rows of Π is t = O(ε−2 (log d)( d + log n)2 ), and the time to compute Π A is O(nd log t ). [sent-220, score-0.12]

59 2 ≤ Π Ax 2 1 Norm The results can be generalized to subspace embeddings with respect to the 1 -norm [7, 14, 21]. [sent-221, score-0.122]

60 The best known bounds are due to Woodruff and Zhang [21], so we use their family of embedding matrices in what follows. [sent-222, score-0.105]

61 Their family of matrices Ψ is chosen to be of the form Π · E, where Π is as above with parameter t = d1+γ for arbitrarily small constant γ > 0, and E is a diagonal matrix with Ei,i = 1/ui , where u1 , . [sent-226, score-0.144]

62 Let Π = ΦD be a sparse embedding matrix for the 2 norm with associated hash function h : [n] → [t] for an arbitrary value of t, and let E be any diagonal matrix. [sent-246, score-0.19]

63 3: 4: 5: 6: Let Π = ΦD be a sparse embedding matrix for the 2 norm with t = O((dq)2 /ε2 ). [sent-252, score-0.163]

64 Compute Π (ΠTq (A)) and Π Πb, where Π is a subsampled randomized Hadamard transform of Theorem 7 with t √ √ O(ε−2 (log(dq))( dq + log t)2 ) rows. [sent-255, score-0.49]

65 Hence, in what follows, we assume that d = 1 and our matrix A is a column vector a. [sent-258, score-0.099]

66 For any vector z, there is a deterministic algorithm to compute the matrix vector product Tq (A) · z in O((nnz(A) + dq) log2 q) time. [sent-276, score-0.074]

67 For any vector z, there is a deterministic algorithm to compute the matrix vector product z · Tq (A) in O((nnz(A) + dq) log2 q) time. [sent-279, score-0.074]

68 3 Fast 2 -regression We start by considering the structured regression problem in the case p = 2. [sent-282, score-0.245]

69 p the structured regression with p = 2 in time O(nnz(A) log2 q) + poly(dq/ε). [sent-286, score-0.245]

70 Proof: By the properties of a sparse embedding matrix (see Section 3. [sent-287, score-0.136]

71 99, for t = O((dq)2 /ε2 ), we have simultaneously for all y in the span of the columns of Tq (A) adjoined with b, (1 − ε) y 2 ≤ Πy 2 ≤ (1 + ε) y 2 , since the span of this space has dimension at most dq + 1. [sent-290, score-0.378]

72 3: Let Ψ = ΠE = ΦDE be a subspace embedding matrix for the 1 norm with t = (dq + 1)1+γ for an arbitrarily small constant γ > 0. [sent-301, score-0.239]

73 ˜ Let S be the diagonal matrix of Theorem 6 formed by sampling O(q 1+γ/2 d4+γ/2 ε−2 ) rows of Tq (A) and corresponding entries of b using the scheme of Theorem 6. [sent-311, score-0.143]

74 Theorem 13 There is an algorithm which solves the structured regression problem with p = 2 in time O((nnz(A) + dq) log(1/ε)) + poly(dq) w. [sent-316, score-0.271]

75 4 Fast 1 -regression We now consider the structured regression in the case p = 1. [sent-321, score-0.245]

76 p the structured regression in problem with p = 1 in time O(nnz(A) log n log2 q) + poly(dqε−1 log n). [sent-325, score-0.297]

77 4 Experiments We report two sets of experiments on classification and regression datasets. [sent-328, score-0.169]

78 The first set of experiments compares generalization performance of our structured nonlinear least squares regression models against standard linear regression, and nonlinear regression with random fourier features [16]. [sent-329, score-0.665]

79 As expected, ordinary 2 linear regression is very fast, especially if the matrix is sparse. [sent-333, score-0.243]

80 Additive polynomial regression maintains the sparsity structure so it can exploit fast sparse solvers. [sent-336, score-0.298]

81 When compared with random Fourier features, for the same number of random features D, additive polynomial regression with random features get better results than regression with random Fourier features. [sent-338, score-0.617]

82 However, computing the random features is one of the most expensive steps, so computing better approximations with fewer random features is desirable. [sent-340, score-0.104]

83 Figure 1 reports the benefit of sketching in terms of running times, and the trade-off in terms of accuracy. [sent-341, score-0.238]

84 9 sec classification n = 60, 000, d = 784 k = 10, 000 CPU regression n = 6, 554, d = 21 k = 819 ADULT classification n = 32, 561, d = 123 k = 16, 281 CENSUS regression n = 18, 186, d = 119 k = 2, 273 25. [sent-391, score-0.548]

85 3 sec FOREST COVER classification n = 522, 910, d = 54 k = 58, 102 Table 1: Comparison of testing error and training time of the different methods. [sent-393, score-0.21]

86 For regression tasks, we report yp − y 2 / y where yp is the predicted values and y is the ground truth. [sent-402, score-0.377]

87 ” stands for additive polynomial 7 Sketching Sampling Exact 6 5 4 3 0% 10% 20% 30% 40% Sketch Size (% of Examples) (b) Figure 1: Examining the performance of sketching. [sent-404, score-0.175]

88 50% (c) We compute 1,500 random features, and then solve the corresponding additive polynomial regression problem with q = 4, both exactly and with sketching to different number of rows. [sent-405, score-0.582]

89 Figure 1 (a) plots the speedup of the sketched method relative to the exact solution. [sent-407, score-0.087]

90 In these experiments we use a non-optimized straightforward implementation that does not exploit fast Vandermonde multiplication or parallel processing. [sent-408, score-0.085]

91 We measured only the time required to solve the regression problem. [sent-410, score-0.169]

92 Figure 1 (b) explores the sub-optimality in solving the regression problem. [sent-413, score-0.169]

93 We see that indeed the error decreases as the size of the sampled matrix grows, and that with a sketch size that is not too big we get to about a 10% larger objective. [sent-415, score-0.127]

94 Encouragingly, a sketch as small as 15% of the number of examples is enough to have a very small increase in error rate, while still solving the regression problem more than 5 times faster (the speedup is expected to grow for larger datasets). [sent-417, score-0.309]

95 The Fast Cauchy Transform and faster faster robust regression. [sent-458, score-0.084]

96 Low rank approximation and regression in input sparsity time. [sent-475, score-0.169]

97 Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. [sent-522, score-0.122]

98 OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings. [sent-529, score-0.076]

99 A fast randomized algorithm for overdetermined linear least-squares regression. [sent-540, score-0.108]

100 Subspace embeddings and lp regression using exponential random variables. [sent-561, score-0.25]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tq', 0.596), ('nnz', 0.328), ('dq', 0.327), ('sketching', 0.238), ('sec', 0.21), ('regression', 0.169), ('vandermonde', 0.156), ('ax', 0.127), ('poly', 0.11), ('fourier', 0.106), ('yp', 0.104), ('polynomial', 0.095), ('additive', 0.08), ('egression', 0.078), ('truct', 0.078), ('zx', 0.078), ('structured', 0.076), ('subspace', 0.076), ('matrix', 0.074), ('randomized', 0.074), ('rn', 0.07), ('pq', 0.068), ('ag', 0.065), ('woodruff', 0.063), ('embedding', 0.062), ('clarkson', 0.06), ('dtq', 0.058), ('sketch', 0.053), ('features', 0.052), ('sg', 0.052), ('monomial', 0.052), ('multiplication', 0.051), ('rd', 0.048), ('embeddings', 0.046), ('kernel', 0.046), ('speedup', 0.045), ('matrices', 0.043), ('sketched', 0.042), ('faster', 0.042), ('rows', 0.042), ('xq', 0.041), ('drineas', 0.041), ('detq', 0.039), ('maclaurin', 0.039), ('oly', 0.039), ('ose', 0.039), ('ourier', 0.039), ('sax', 0.039), ('sine', 0.039), ('stoc', 0.039), ('lk', 0.039), ('zi', 0.038), ('lemma', 0.038), ('mahoney', 0.037), ('subsampled', 0.036), ('lp', 0.035), ('oblivious', 0.035), ('pi', 0.034), ('upto', 0.034), ('fast', 0.034), ('sm', 0.034), ('nonlinear', 0.033), ('minx', 0.033), ('sarl', 0.032), ('title', 0.032), ('accelerate', 0.032), ('id', 0.031), ('runtime', 0.031), ('nk', 0.031), ('hadamard', 0.03), ('meng', 0.03), ('symposium', 0.029), ('rahimi', 0.028), ('maps', 0.028), ('transform', 0.027), ('boutsidis', 0.027), ('diagonal', 0.027), ('norm', 0.027), ('xn', 0.027), ('squares', 0.027), ('zj', 0.026), ('simultaneously', 0.026), ('log', 0.026), ('argminx', 0.026), ('charikar', 0.026), ('appealing', 0.026), ('solvers', 0.026), ('solves', 0.026), ('ny', 0.026), ('columns', 0.025), ('column', 0.025), ('storage', 0.024), ('constraint', 0.024), ('kij', 0.024), ('exploited', 0.024), ('df', 0.023), ('design', 0.023), ('fi', 0.023), ('ibm', 0.023), ('cosine', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

Author: Haim Avron, Vikas Sindhwani, David Woodruff

Abstract: Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. 1

2 0.099849164 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms

Author: Yu Zhang

Abstract: All the existing multi-task local learning methods are defined on homogeneous neighborhood which consists of all data points from only one task. In this paper, different from existing methods, we propose local learning methods for multitask classification and regression problems based on heterogeneous neighborhood which is defined on data points from all tasks. Specifically, we extend the knearest-neighbor classifier by formulating the decision function for each data point as a weighted voting among the neighbors from all tasks where the weights are task-specific. By defining a regularizer to enforce the task-specific weight matrix to approach a symmetric one, a regularized objective function is proposed and an efficient coordinate descent method is developed to solve it. For regression problems, we extend the kernel regression to multi-task setting in a similar way to the classification case. Experiments on some toy data and real-world datasets demonstrate the effectiveness of our proposed methods. 1

3 0.098512642 91 nips-2013-Dirty Statistical Models

Author: Eunho Yang, Pradeep Ravikumar

Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1

4 0.088246852 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

Author: Dimitris Achlioptas, Zohar S. Karnin, Edo Liberty

Abstract: We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A B 2 . For large m n matrices, such that n m (for example, representing n observations over m attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding A. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with O 1 computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model. 1

5 0.083639696 137 nips-2013-High-Dimensional Gaussian Process Bandits

Author: Josip Djolonga, Andreas Krause, Volkan Cevher

Abstract: Many applications in machine learning require optimizing unknown functions defined over a high-dimensional space from noisy samples that are expensive to obtain. We address this notoriously hard challenge, under the assumptions that the function varies only along some low-dimensional subspace and is smooth (i.e., it has a low norm in a Reproducible Kernel Hilbert Space). In particular, we present the SI-BO algorithm, which leverages recent low-rank matrix recovery techniques to learn the underlying subspace of the unknown function and applies Gaussian Process Upper Confidence sampling for optimization of the function. We carefully calibrate the exploration–exploitation tradeoff by allocating the sampling budget to subspace estimation and function optimization, and obtain the first subexponential cumulative regret bounds and convergence rates for Bayesian optimization in high-dimensions under noisy observations. Numerical results demonstrate the effectiveness of our approach in difficult scenarios. 1

6 0.074932761 188 nips-2013-Memory Limited, Streaming PCA

7 0.073346242 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

8 0.071960047 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

9 0.066475786 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

10 0.065281786 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

11 0.063805871 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

12 0.063699029 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

13 0.061954215 224 nips-2013-On the Sample Complexity of Subspace Learning

14 0.060708094 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

15 0.057060558 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

16 0.055892427 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization

17 0.052081782 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

18 0.049135324 185 nips-2013-Matrix Completion From any Given Set of Observations

19 0.048098162 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

20 0.047533967 318 nips-2013-Structured Learning via Logistic Regression


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.151), (1, 0.057), (2, 0.068), (3, 0.047), (4, 0.015), (5, 0.034), (6, -0.021), (7, 0.003), (8, -0.085), (9, 0.004), (10, 0.026), (11, 0.019), (12, -0.006), (13, 0.015), (14, 0.044), (15, 0.027), (16, -0.027), (17, 0.035), (18, -0.028), (19, -0.025), (20, -0.044), (21, -0.041), (22, -0.014), (23, 0.006), (24, 0.017), (25, -0.016), (26, 0.03), (27, -0.024), (28, 0.014), (29, -0.021), (30, -0.036), (31, -0.056), (32, 0.011), (33, 0.022), (34, 0.001), (35, 0.058), (36, -0.085), (37, -0.015), (38, 0.026), (39, -0.08), (40, -0.027), (41, -0.058), (42, -0.112), (43, 0.003), (44, -0.121), (45, 0.032), (46, 0.021), (47, 0.027), (48, 0.015), (49, -0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90803611 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

Author: Haim Avron, Vikas Sindhwani, David Woodruff

Abstract: Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. 1

2 0.66227371 91 nips-2013-Dirty Statistical Models

Author: Eunho Yang, Pradeep Ravikumar

Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1

3 0.64692473 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

Author: Yuchen Zhang, John Duchi, Michael Jordan, Martin J. Wainwright

Abstract: We establish lower bounds on minimax risks for distributed statistical estimation under a communication budget. Such lower bounds reveal the minimum amount of communication required by any procedure to achieve the centralized minimax-optimal rates for statistical estimation. We study two classes of protocols: one in which machines send messages independently, and a second allowing for interactive communication. We establish lower bounds for several problems, including various types of location models, as well as for parameter estimation in regression models. 1

4 0.63987619 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

Author: Paramveer Dhillon, Yichao Lu, Dean P. Foster, Lyle Ungar

Abstract: We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O( p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy. 1

5 0.63358498 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform

Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar

Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1

6 0.61215031 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

7 0.60896891 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

8 0.60408306 76 nips-2013-Correlated random features for fast semi-supervised learning

9 0.5940299 186 nips-2013-Matrix factorization with binary components

10 0.59229255 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

11 0.58868718 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization

12 0.58455032 245 nips-2013-Pass-efficient unsupervised feature selection

13 0.56238753 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

14 0.56209356 137 nips-2013-High-Dimensional Gaussian Process Bandits

15 0.55218983 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC

16 0.54707873 244 nips-2013-Parametric Task Learning

17 0.54090649 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

18 0.5399968 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima

19 0.53891367 224 nips-2013-On the Sample Complexity of Subspace Learning

20 0.53888917 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.032), (16, 0.071), (33, 0.119), (34, 0.077), (36, 0.02), (41, 0.032), (45, 0.269), (49, 0.028), (56, 0.098), (70, 0.012), (85, 0.051), (89, 0.032), (93, 0.051), (95, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73151559 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

Author: Haim Avron, Vikas Sindhwani, David Woodruff

Abstract: Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. 1

2 0.72692984 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks

Author: Nan Du, Le Song, Manuel Gomez-Rodriguez, Hongyuan Zha

Abstract: If a piece of information is released from a media site, can we predict whether it may spread to one million web pages, in a month ? This influence estimation problem is very challenging since both the time-sensitive nature of the task and the requirement of scalability need to be addressed simultaneously. In this paper, we propose a randomized algorithm for influence estimation in continuous-time diffusion networks. Our algorithm can estimate the influence of every node in a network with |V| nodes and |E| edges to an accuracy of using n = O(1/ 2 ) randomizations and up to logarithmic factors O(n|E|+n|V|) computations. When used as a subroutine in a greedy influence maximization approach, our proposed algorithm is guaranteed to find a set of C nodes with the influence of at least (1 − 1/e) OPT −2C , where OPT is the optimal value. Experiments on both synthetic and real-world data show that the proposed algorithm can easily scale up to networks of millions of nodes while significantly improves over previous state-of-the-arts in terms of the accuracy of the estimated influence and the quality of the selected nodes in maximizing the influence. 1

3 0.64593685 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola

Abstract: In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1

4 0.58252329 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

Author: Cho-Jui Hsieh, Matyas A. Sustik, Inderjit Dhillon, Pradeep Ravikumar, Russell Poldrack

Abstract: The 1 -regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm B IG QUIC, which can solve 1 million dimensional 1 regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that B IG QUIC can achieve super-linear or even quadratic convergence rates. 1

5 0.57919848 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9

6 0.57838863 350 nips-2013-Wavelets on Graphs via Deep Learning

7 0.5766651 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

8 0.57604939 232 nips-2013-Online PCA for Contaminated Data

9 0.57562762 233 nips-2013-Online Robust PCA via Stochastic Optimization

10 0.57500136 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

11 0.57429564 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

12 0.57401973 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

13 0.57380611 252 nips-2013-Predictive PAC Learning and Process Decompositions

14 0.57363439 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

15 0.57321358 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

16 0.57296592 318 nips-2013-Structured Learning via Logistic Regression

17 0.57264602 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

18 0.57223719 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

19 0.57212651 201 nips-2013-Multi-Task Bayesian Optimization

20 0.57203519 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints