nips nips2009 nips2009-91 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samory Kpotufe
Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Fast, smooth and adaptive regression in metric spaces Samory Kpotufe UCSD CSE Abstract It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). [sent-1, score-0.544]
2 In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). [sent-3, score-0.553]
3 We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. [sent-4, score-0.13]
4 1 Introduction Relative to parametric methods, nonparametric regressors require few structural assumptions on the function being learned. [sent-5, score-0.135]
5 This so-called curse of dimensionality is quantified by various lower bounds on the convergence rates of the form n−2/(2+D) for data in RD (see e. [sent-7, score-0.151]
6 [5, 6, 7]), is to embed the data into a lower dimensional space where the regressor might work well. [sent-16, score-0.287]
7 ones that operate in RD but attain convergence rates that depend just on the intrinsic dimension of data. [sent-19, score-0.241]
8 An initial result [1] shows that for data on a ddimensional manifold, the asymptotic risk at a point x ∈ RD depends just on d and on the behavior of the distribution in a neighborhood of x. [sent-20, score-0.223]
9 Later, [2] showed that a regressor based on the RPtree of [8] (a hierarchical partitioning procedure) is not only fast to evaluate, but is adaptive to Assouad dimension, a measure which captures notions such as manifold dimension and data sparsity. [sent-21, score-0.5]
10 [9]) was shown in an earlier work [10] to control the risk of nearest neighbor regression, although adaptivity was not a subject of that result. [sent-24, score-0.263]
11 This work extends the applicability of such adaptivity results to more general uses of nonparametric regression. [sent-25, score-0.081]
12 In particular, we present an adaptive regressor which, unlike RPtree, operates on a general metric space where only distances are provided, and yields a smooth function, an important property in many domains (see e. [sent-26, score-0.414]
13 In addition, our regressor can be evaluated in time just O(log n), unlike kernel or nearest neighbor regression. [sent-29, score-0.419]
14 The evaluation time for these two forms of regression is lower bounded by the number of sample points contributing to the regression estimate. [sent-30, score-0.301]
15 For nearest neighbor regression, this number is given by a parameter kn whose optimal setting (see [12]) is ( ) O n2/(2+d) . [sent-31, score-0.079]
16 For kernel regression, given an optimal bandwidth h ≈ n−1/(2+d) (see [12]), we would expect about nhd ≈ n2/(2+d) points in the ball B(x, h) around a query point x. [sent-32, score-0.27]
17 We note that there exist many heuristics for speeding up kernel regression, which generally combine fast proximity search procedures with other elaborate methods for approximating the kernel weights (see e. [sent-33, score-0.259]
18 There are no rigorous bounds on either the achievable speedup or the risk of the resulting regressor. [sent-36, score-0.14]
19 For regression each center contributes the average Y value of the data points assigned to them (points in the cells). [sent-39, score-0.144]
20 Right- Given an r-net and a bandwidth h, a kernel around a query point x weights the Y -contribution of each center to the regression estimate for x. [sent-40, score-0.24]
21 Our regressor integrates aspects of both tree-based regression and kernel regression. [sent-41, score-0.418]
22 It constructs partitions of the input dataset X = {Xi }n , and uses a kernel to select a few sets within a given 1 partition, each set contributing its average output)Y value to the estimate. [sent-42, score-0.123]
23 We show that such a ( −2/(2+d) regressor achieves an excess risk of O n , where d is the Assouad dimension of the input ) ( data space. [sent-43, score-0.662]
24 This is a tighter convergence rate than the O n−2/(2+O(d log d) of RPtree regression (see [2]). [sent-44, score-0.15]
25 Finally, the evaluation time of O(log n) is arrived at by modifying the cover tree proximity search procedure of [16]. [sent-45, score-0.22]
26 d training data (X, Y) = {(Xi , Yi )}1 , where the input variable X belongs to a metric space X where the distance between points is given by the metric ρ, and the output Y belongs to a subset Y of some Euclidean space. [sent-50, score-0.15]
27 Assouad dimension: The Assouad or doubling dimension of X is defined as the smallest d such that any ball can be covered by 2d balls of half its radius. [sent-52, score-0.168]
28 Examples: A d-dimensional affine subspace of a Euclidean space RD has Assouad dimension O(d) [9]. [sent-53, score-0.098]
29 A d-dimensional submanifold of a Euclidean space RD has Assouad dimension O(d) subject to a bound on its curvature [8]. [sent-54, score-0.124]
30 one where each data point has at most d non zero coordinates, has Assouad dimension O(d log D) [8, 2]. [sent-57, score-0.193]
31 The algorithm has no knowledge of the dimension d, nor of ∆Y , although we assume ∆X is known (or can be upper-bounded). [sent-58, score-0.098]
32 Regression function: We assume the regression function f (x) = E [Y |X = x] is Lipschitz, i. [sent-60, score-0.078]
33 Excess risk: Our performance criteria for a regressor fn (x) is the integrated excess l2 risk: 2 . [sent-63, score-0.485]
34 2 2 2 fn − f = E fn (X) − f (X) = E fn (X) − Y − E f (X) − Y . [sent-64, score-0.183]
35 1 X,Y X,Y (1) Algorithm overview We’ll consider a set of partitions of the data induced by a hierarchy of r-nets of X. [sent-66, score-0.071]
36 Here an r-net Qr is understood to be both an r-cover of X (all points in X are within r of some point in Qr ), and an r-packing (the points in Qr are at least r apart). [sent-67, score-0.16]
37 For now, we’ll consider a class of regressors defined over these nets (as illustrated in Figure 1), and we’ll describe how to select a good regressor out of this class. [sent-69, score-0.454]
38 X(q) designate all those points in X whose closest point in Q is q. [sent-73, score-0.184]
39 We set nq = |X(q)|, and ∑ 1 ¯ Yq = nq i:Xi ∈X(q) Yi . [sent-74, score-0.976]
40 } { Regressors: For each Q ∈ Qr , r ∈ {∆X /2i }I+2 , and given a bandwidth h, we define the fol0 lowing regressor: ∑ nq (K(x, q, h) + ) ¯ fn,Q (x) = wq (x)Yq , where wq = ∑ . [sent-77, score-1.027]
41 (2) q ∈Q nq (K(x, q , h) + ) q∈Q The positive constant ensures that the estimate remains well defined when K(x, q, h) = 0. [sent-78, score-0.512]
42 Selecting the final regressor: For fixed n, K(·), and {Qr , r ∈ {∆X /2i }I+2 }, equation (2) above 0 defines a class of regressors parameterized by r ∈ {∆X /2i }I+2 , and the bandwidth h. [sent-84, score-0.149]
43 We want 0 to pick a good regressor out of this class. [sent-85, score-0.287]
44 We can reduce the search space right away by noticing that we need r = θ(h): if r > h then B(x, h) ∩ Qr is empty for most x since the points in Qr > are over r apart, and if r < h then B(x, h) ∩ Qr might contain a lot of points, thus increasing < evaluation time. [sent-86, score-0.192]
45 As before let I = log n , and define H = {∆X /2i }I . [sent-92, score-0.068]
46 For 0 every h ∈ H, pick the r-net Qh/4 and test fn,Qh/4 on (X , Y ); let the empirical risk be minimized ∑n 2 . [sent-93, score-0.168]
47 Fast evaluation: Each regressor fn,Qh/4 (x) can be estimated quickly on points x by traversing (nested) r-nets as described in detail in section 4. [sent-98, score-0.353]
48 2 Computational and prediction performance The cover property ensures that for some h, Qh/4 is a good summary of local information (for prediction performance), while the packing property ensures that few points in Qh/4 fall in B(x, h) (for fast evaluation). [sent-100, score-0.245]
49 Let d be the Assouad dimension of X and let n ≥ max 9, λ∆X , ∆Y . [sent-103, score-0.161]
50 (a) The final regressor selected satisfies E fn,Qho /4 − f 2 ( ≤ C (λ∆X ) 2d/(2+d) ∆2 Y n √ )2/(2+d) + 3∆2 Y ln(n log n) , n where C depends on the Assouad dimension d and on K(0)/K(1/2). [sent-104, score-0.45]
51 We’ll bound the risk for fn,Q for any fixed choice of h, and then show that the final h0 selected yields a good risk. [sent-108, score-0.166]
52 The results in this section only require the fact that Q is a cover of data and thus preserves local information, while the packing property is needed in the next section for fast evaluation. [sent-109, score-0.131]
53 We have the following standard decomposition of the excess risk into variance and bias terms: ∀x ∈ X , E 2 fn,Q (x) − f (x) Y|X = E 2 fn,Q (x) − fn,Q (x) Y|X + fn,Q (x) − f (x) 2 . [sent-114, score-0.277]
54 Remember that for independent random vectors vi with expectation 0, E = i vi ∑ 2 i E vi . [sent-122, score-0.078]
55 q ∈Q nq (K(x, q , h) + ) q∈Q nq K(x, q, h) max ∑ q∈Q (4) To bound the fraction in (4), we lower-bound the denominator as: ∑ ∑ ∑ nq K(1/2) ≥ K(1/2) · nµn (B(x, h/4)). [sent-125, score-1.525]
56 nq K(x, q, h) ≥ nq K(x, q, h) ≥ q∈Q q:ρ(x,q)≤h/2 q:ρ(x,q)≤h/2 The last inequality follows by remarking that, since Q is an h -cover of X, the ball B(x, h/4) can 4 only contain points from ∪q:ρ(x,q)≤h/2 X(q). [sent-126, score-1.139]
57 We have fn,Q (x) − f (x) 2 ∑ wq (x) = nq q∈Q ∑ 2 (f (Xi ) − f (x)) ≤ Xi ∈X(q) ∑ wq (x) nq q∈Q ∑ f (Xi ) − f (x) 2 , Xi ∈X(q) where we just applied Jensen’s inequality on the norm square. [sent-133, score-1.495]
58 ∑ wq (x) ∑ ∑ wq (x) ∑ 2 2 f (Xi ) − f (x) ≤ λ2 ρ (Xi , x) nq nq Xi ∈X(q) q:ρ(x,q) 0, the set B(x, h/2) ∩ Q cannot be empty (remember that Q is an h -cover of X). [sent-137, score-1.502]
59 We have 4 ∆2 Y + 2λ2 h2 , n · (h/∆X )d (X,Y) where C0 depends on the Assouad dimension d and on K(0)/K(1/2). [sent-141, score-0.123]
60 Applying Fubini’s theorem, the expected excess risk, E(X,Y) fn,Q − f , can be written as ) 2( E E fn,Q (X) − f (X) 1{µn (B(X,h/4))>0} + 1{µn (B(X,h/4))=0} . [sent-143, score-0.137]
61 X (X,Y) By lemmas 1 and 2 we have for X = x fixed, ] ∆2 1{µn (B(x,h/4))>0} Y + 2λ2 h2 + X nµn (B(x, h/4)) (X,Y) ( ) 2∆2 ∆2 Y ≤ C1 + 2λ2 h2 + Y , nµ(B(x, h/4)) n ] [ 1 2 where for the last inequality we used the fact that for a binomial b(n, p), E {b(n,p)>0} ≤ np b(n,p) lemma 4. [sent-144, score-0.138]
62 nµ(B(x, h/4)) Combining (6) and (5), we can then bound the expected excess risk as [ ] 3C1 ∆2 ∆2 1 2 Y E fn,Q − f ≤ E + 2λ2 h2 + Y . [sent-147, score-0.303]
63 The final regressor selected satisfies E fn,Qho /4 − f 2 ( ≤ C (λ∆X ) 2d/(2+d) ∆2 Y n √ )2/(2+d) + 3∆2 Y ln(n log n) , n where C depends on the Assouad dimension d and on K(0)/K(1/2). [sent-159, score-0.45]
64 Applying McDiarmid’s to√ empirical risk followed by a union bound over H, we have that, with the probability at least 1 − 1/ n over the choice of (X , Y ), for all h ∈ H √ √ n 1∑ ln(|H| n) 2 2 E fn,Qh/4 (X) − Y − fn,Qh/4 (Xi ) − Yi ≤ ∆Y . [sent-164, score-0.166]
65 4 Fast evaluation In this section we show how to modify the cover-tree procedure of [16] to enable fast evaluation of . [sent-167, score-0.187]
66 1 The cover-tree performs proximity search by navigating a hierarchy of nested r-nets of X. [sent-169, score-0.234]
67 They require additional book-keeping to enable range queries of the form X ∩ B(x, h), for a query point x. [sent-171, score-0.132]
68 Note that, for each h and Qh/4 , one could use a generic range search procedure such as [17] with the data in Qh/4 as input, but this requires building a separate data structure for each h, which is expensive. [sent-173, score-0.117]
69 1 The hierarchy of nets { }n Consider an ordering X(i) 1 of the data points obtained as follows: X(1) and X(2) are the farthest { } points in X; inductively for 2 < i < n, X(i) in X is the farthest point from X(1) , . [sent-176, score-0.362]
70 { }n Neighborhood graph: The nodes of the graph are the X(i) 1 , and the edges are given by the following parent-child relationship: starting at r = ∆X /2, the parent of each node in Qr \ Q2r is the point it is closest to in Q2r . [sent-190, score-0.154]
71 The graph is implemented by maintaining an ordered list of children { }n for each node, where the order is given by the children’s appearance in the sequence X(i) 1 . [sent-191, score-0.166]
72 They define a parent-child relationship implemented by the neighborhood graph (right), the structure traversed for fast evaluation. [sent-194, score-0.104]
73 These ordered lists of children are } used to implement the operation nextChildren defined itera{ n tively as follows. [sent-195, score-0.139]
74 Given Q ⊂ X(i) 1 , let visited children denote any child of q ∈ Q that a previous call to nextChildren has already returned. [sent-196, score-0.261]
75 The call nextChildren (Q) returns children of q ∈ Q { }n that have not yet been visited, starting with the unvisited child with lowest index in X(i) 1 , say X(i) , and returning all unvisited children in Qr , the first net containing X(i) , i. [sent-197, score-0.447]
76 The children returned are then marked off as visited. [sent-200, score-0.139]
77 The time complexity of this routine is just the number of children returned. [sent-201, score-0.139]
78 Break loop ; if ρ (x, Q ) ≥ h + 2r then // The set Qh/4 ∩ B(x, h) is empty. [sent-208, score-0.077]
79 Q ← ∅; Break loop ; Q ← {q ∈ Q , ρ (x, q) < ρ (x, Q ) + 2r}; until . [sent-209, score-0.077]
80 The identification of X(i) is done by going down the levels of nested nets, and discarding those points (and their descendants) that are certain to be farther to x than X(i) (we will argue that “ρ (x, Q ) + 2r” is an upper-bound ( ) on ρ x, X(i) ). [sent-214, score-0.144]
81 Also, if x is far enough from all points at the current level (second if-clause), we can safely stop early because B(x, h) is unlikely to contain points from Qh/4 (we’ll see that points in Qh/4 are all within 2r of their ancestor at the current level). [sent-215, score-0.239]
82 The call to procedure evaluate (x,h) correctly evaluates fn,Qh/4 (x) and has time complexity C log (∆X /h) + log n where C is at most 28d . [sent-217, score-0.18]
83 We first show that the algorithm correctly returns fn,Qαh (x), and we then argue its run time. [sent-219, score-0.074]
84 The procedure( works by first finding the closest point to x in Qh/4 , ) ( ) say X(i) , and then identifying all nodes in R(i),h/4 ∩ B(x, h) = Qh/4 ∩ B(x, h) (see the first if-clause). [sent-221, score-0.094]
85 We just have to show that this closest point X(i) is correctly identified. [sent-222, score-0.137]
86 We’ll argue the following loop invariant I: at the beginning of the loop, X(i) is either in Q = Q ∪ Q or is a descendant of a node in Q . [sent-223, score-0.196]
87 Now let X(j) be the ancestor in Q of X(i) or let it be X(i) ( ) ∑∞ itself if it’s in Q . [sent-227, score-0.097]
88 Let X(k) be a child of X(j) that has not yet been visited, or a descendant of such a child. [sent-236, score-0.099]
89 Starting from Q∆X , a different net Qr is reached at every iteration, and the loop stops when we reach past Qh/4 . [sent-242, score-0.147]
90 Therefore the loop is entered at most log (∆X /h/4) times. [sent-243, score-0.206]
91 In each iteration, most of the work is done parsing through Q , besides time spent for the range search in the last iteration. [sent-244, score-0.115]
92 So the total runtime is O (log (∆X /h/4) · max |Q |) + range search time. [sent-245, score-0.154]
93 We just need to bound max |Q | ≤ max |Q| + max |Q | and the range search time. [sent-246, score-0.222]
94 1 of [9]) will come in handy: consider r1 and r2 such that r1 /r2 is a power of 2, and let B ⊂ X be a ball of radius r1 ; since X has Assouad dimension d, the smallest r2 -cover of B is of size at most (r1 /r2 )d , and the largest r2 -packing of B is of size at most (r1 /r2 )2d . [sent-250, score-0.168]
95 This is true for any metric space, and therefore holds for X which is of Assouad dimension at most d by inclusion in X . [sent-251, score-0.14]
96 Let q ∈ Q, the children of q in Q are not in Q2r and therefore are all within 2r of Q; since these children an r-packing of B(q, 2r), there are at most 22d of them. [sent-253, score-0.278]
97 Now ρ (x, Q ) ≤ h + 2r ≤ 4r + 2r since the if-clauses were not entered if we got to the end of the iteration. [sent-257, score-0.089]
98 To finish, the range search around X(i) takes time R(i),h/4 ≤ 28d since R(i),h/4 is an h -packing 4 ( ) of B X(i) , 2h . [sent-259, score-0.091]
99 Escaping the curse of dimensionality with a tree-based regressor. [sent-273, score-0.086]
100 Rates of convergence of nearest neighbor estimation under arbitrary sampling. [sent-318, score-0.111]
wordName wordTfidf (topN-words)
[('nq', 0.488), ('qr', 0.346), ('assouad', 0.29), ('regressor', 0.287), ('yq', 0.276), ('wq', 0.244), ('risk', 0.14), ('children', 0.139), ('ll', 0.138), ('excess', 0.137), ('nextchildren', 0.111), ('dimension', 0.098), ('regressors', 0.098), ('entered', 0.089), ('rptree', 0.083), ('regression', 0.078), ('loop', 0.077), ('nets', 0.069), ('closest', 0.066), ('points', 0.066), ('proximity', 0.061), ('fn', 0.061), ('lemma', 0.058), ('descendant', 0.055), ('zi', 0.054), ('kernel', 0.053), ('rd', 0.053), ('ln', 0.052), ('bandwidth', 0.051), ('visited', 0.05), ('farthest', 0.048), ('xi', 0.048), ('curse', 0.047), ('nested', 0.047), ('fast', 0.047), ('intrinsic', 0.047), ('range', 0.046), ('search', 0.045), ('cover', 0.045), ('adaptivity', 0.044), ('navigating', 0.044), ('child', 0.044), ('correctly', 0.043), ('nearest', 0.043), ('evaluation', 0.043), ('net', 0.043), ('yi', 0.042), ('metric', 0.042), ('ball', 0.042), ('ancestor', 0.041), ('remember', 0.041), ('unvisited', 0.041), ('manifold', 0.04), ('log', 0.04), ('packing', 0.039), ('dimensionality', 0.039), ('empty', 0.038), ('nonparametric', 0.037), ('hierarchy', 0.037), ('neighbor', 0.036), ('contributing', 0.036), ('max', 0.035), ('partitions', 0.034), ('dasgupta', 0.033), ('nal', 0.033), ('node', 0.033), ('rates', 0.033), ('ho', 0.032), ('convergence', 0.032), ('break', 0.031), ('evaluates', 0.031), ('inequality', 0.031), ('argue', 0.031), ('attain', 0.031), ('neighborhood', 0.03), ('query', 0.03), ('smooth', 0.03), ('thanks', 0.028), ('covered', 0.028), ('runtime', 0.028), ('let', 0.028), ('point', 0.028), ('enable', 0.028), ('adaptive', 0.028), ('non', 0.027), ('reached', 0.027), ('graph', 0.027), ('operates', 0.027), ('iteration', 0.026), ('vi', 0.026), ('procedure', 0.026), ('bound', 0.026), ('depends', 0.025), ('lemmas', 0.025), ('ensures', 0.024), ('last', 0.024), ('designate', 0.024), ('handy', 0.024), ('alamos', 0.024), ('argminh', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
Author: Samory Kpotufe
Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1
2 0.12867059 55 nips-2009-Compressed Least-Squares Regression
Author: Odalric Maillard, Rémi Munos
Abstract: We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M . From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting “Compressed Least Squares Re√ gression” (CLSR) in terms of N , K, and M . When we choose M = O( K), we √ show that CLSR has an estimation error of order O(log K/ K). 1 Problem setting We consider a regression problem where we observe data DK = ({xk , yk }k≤K ) (where xk ∈ X and yk ∈ R) are assumed to be independently and identically distributed (i.i.d.) from some distribution P , where xk ∼ PX and yk = f ∗ (xk ) + ηk (xk ), where f ∗ is the (unknown) target function, and ηk a centered independent noise of variance σ 2 (xk ). For a given class of functions F, and f ∈ F, we define the empirical (quadratic) error def LK (f ) = 1 K K [yk − f (xk )]2 , k=1 and the generalization (quadratic) error def L(f ) = E(X,Y )∼P [(Y − f (X))2 ]. Our goal is to return a regression function f ∈ F with lowest possible generalization error L(f ). Notations: In the sequel we will make use of the following notations about norms: for h : X → R, we write ||h||P for the L2 norm of h with respect to (w.r.t.) the measure P , ||h||PK for the L2 norm n 2 1/2 of h w.r.t. the empirical measure PK , and for u ∈ Rn , ||u|| denotes by default . i=1 ui The measurable function minimizing the generalization error is f ∗ , but it may be the case that f ∗ ∈ F. For any regression function f , we define the excess risk / L(f ) − L(f ∗ ) = ||f − f ∗ ||2 , P which decomposes as the sum of the estimation error L(f ) − inf f ∈F L(f ) and the approximation error inf f ∈F L(f ) − L(f ∗ ) = inf f ∈F ||f − f ∗ ||2 which measures the distance between f ∗ and the P function space F. 1 In this paper we consider a class of linear functions FN defined as the span of a set of N functions def def N {ϕn }1≤n≤N called features. Thus: FN = {fα = n=1 αn ϕn , α ∈ RN }. When the number of data K is larger than the number of features N , the ordinary Least-Squares Regression (LSR) provides the LS solution fα which is the minimizer of the empirical risk LK (f ) b 1 in FN . Note that here LK (fα ) rewrites K ||Φα − Y ||K where Φ is the K × N matrix with elements (ϕn (xk ))1≤n≤N,1≤k≤K and Y the K-vector with components (yk )1≤k≤K . Usual results provide bound on the estimation error as a function of the capacity of the function space and the number of data. In the case of linear approximation, the capacity measures (such as covering numbers [23] or the pseudo-dimension [16]) depend on the number of features (for example the pseudo-dimension is at most N + 1). For example, let fα be a LS estimate (minimizer of LK b in FN ), then (a more precise statement will be stated later in Subsection 3) the expected estimation error is bounded as: N log K E L(fα ) − inf L(f ) ≤ cσ2 , (1) b f ∈FN K def where c is a universal constant, σ = supx∈X σ(x), and the expectation is taken with respect to P . Now, the excess risk is the sum of this estimation error and the approximation error inf f ∈FN ||f − f ∗ ||P of the class FN . Since the later usually decreases when the number of features N increases [13] (e.g. when N FN is dense in L2 (P )), we see the usual tradeoff between small estimation error (low N ) and small approximation error (large N ). In this paper we are interested in the setting when N is large so that the approximation error is small. Whenever N is larger than K we face the overfitting problem since there are more parameters than actual data (more variables than constraints), which is illustrated in the bound (1) which provides no information about the generalization ability of any LS estimate. In addition, there are many minimizers (in fact a vector space of same dimension as the null space of ΦT Φ) of the empirical risk. To overcome the problem, several approaches have been proposed in the literature: • LS solution with minimal norm: The solution is the minimizer of the empirical error with minimal (l1 or l2 )-norm: α = arg minΦα=Y ||α||1 or 2 , (or a robust solution arg min||Φα−Y ||2 ≤ε ||α||1 ). The choice of 2 -norm yields the ordinary LS solution. The choice of 1 -norm has been used for generating sparse solutions (e.g. the Basis Pursuit [10]), and assuming that the target function admits a sparse decomposition, the field of Compressed Sensing [9, 21] provides sufficient conditions for recovering the exact solution. However, such conditions (e.g. that Φ possesses a Restricted Isometric Property (RIP)) does not hold in general in this regression setting. On another aspect, solving these problems (both for l1 or l2 -norm) when N is large is numerically expensive. • Regularization. The solution is the minimizer of the empirical error plus a penalty term, for example f = arg min LK (f ) + λ||f ||p , for p = 1 or 2. p f ∈FN where λ is a parameter and usual choices for the norm are 2 (ridge-regression [20]) and 1 (LASSO [19]). A close alternative is the Dantzig selector [8, 5] which solves: α = arg min||α||1 ≤λ ||ΦT (Y − Φα)||∞ . The numerical complexity and generalization bounds of those methods depend on the sparsity of the target function decomposition in FN . Now if we possess a sequence of function classes (FN )N ≥1 with increasing capacity, we may perform structural risk minimization [22] by solving in each model the empirical risk penalized by a term that depends on the size of the model: fN = arg minf ∈FN ,N ≥1 LK (f ) + pen(N, K), where the penalty term measures the capacity of the function space. In this paper we follow another approach where instead of searching in the large space FN (where N > K) for a solution that minimizes the empirical error plus a penalty term, we simply search for the empirical error minimizer in a (randomly generated) lower dimensional subspace GM ⊂ FN (where M < K). Our contribution: We consider a set of M random linear combinations of the initial N features and perform our favorite LS regression algorithm (possibly regularized) using those “compressed 2 features”. This is equivalent to projecting the K points {ϕ(xk ) ∈ RN , k = 1..K} from the initial domain (of size N ) onto a random subspace of dimension M , and then performing the regression in the “compressed domain” (i.e. span of the compressed features). This is made possible because random projections approximately preserve inner products between vectors (by a variant of the Johnson-Lindenstrauss Lemma stated in Proposition 1. Our main result is a bound on the excess risk of a linear estimator built in the compressed domain in terms of the excess risk of the linear estimator built in the initial domain (Section 2). We further detail the case of ordinary Least-Squares Regression (Section 3) and discuss, in terms of M , N , K, the different tradeoffs concerning the excess risk (reduced estimation error in the compressed domain versus increased approximation error introduced by the random projection) and the numerical complexity (reduced complexity of solving the LSR in the compressed domain versus the additional load of performing the projection). √ As a consequence, we show that by choosing M = O( K) projections we define a Compressed Least-Squares Regression which uses O(N K 3/2 ) elementary operations to compute a regression √ function with estimation error (relatively to the initial function space FN ) of order log K/ K up to a multiplicative factor which depends on the best approximation of f ∗ in FN . This is competitive with the best methods, up to our knowledge. Related works: Using dimension reduction and random projections in various learning areas has received considerable interest over the past few years. In [7], the authors use a SVM algorithm in a compressed space for the purpose of classification and show that their resulting algorithm has good generalization properties. In [25], the authors consider a notion of compressed linear regression. For data Y = Xβ + ε, where β is the target and ε a standard noise, they use compression of the set of data, thus considering AY = AXβ + Aε, where A has a Restricted Isometric Property. They provide an analysis of the LASSO estimator built from these compressed data, and discuss a property called sparsistency, i.e. the number of random projections needed to recover β (with high probability) when it is sparse. These works differ from our approach in the fact that we do not consider a compressed (input and/or output) data space but a compressed feature space instead. In [11], the authors discuss how compressed measurements may be useful to solve many detection, classification and estimation problems without having to reconstruct the signal ever. Interestingly, they make no assumption about the signal being sparse, like in our work. In [6, 17], the authors show how to map a kernel k(x, y) = ϕ(x) · ϕ(y) into a low-dimensional space, while still approximately preserving the inner products. Thus they build a low-dimensional feature space specific for (translation invariant) kernels. 2 Linear regression in the compressed domain We remind that the initial set of features is {ϕn : X → def N FN = {fα = n=1 αn ϕn , α ∈ components (ϕn (x))n≤N . Let us R, 1 ≤ n ≤ N } and the initial domain R } is the span of those features. We write ϕ(x) the N -vector of N now define the random projection. Let A be a M × N matrix of i.i.d. elements drawn for some distribution ρ. Examples of distributions are: • Gaussian random variables N (0, 1/M ), √ • ± Bernoulli distributions, i.e. which takes values ±1/ M with equal probability 1/2, • Distribution taking values ± 3/M with probability 1/6 and 0 with probability 2/3. The following result (proof in the supplementary material) states the property that inner-product are approximately preserved through random projections (this is a simple consequence of the JohnsonLindenstrauss Lemma): Proposition 1 Let (uk )1≤k≤K and v be vectors of RN . Let A be a M × N matrix of i.i.d. elements drawn from one of the previously defined distributions. For any ε > 0, δ > 0, for M ≥ ε2 1 ε3 log 4K , we have, with probability at least 1 − δ, for all k ≤ K, δ 4 − 6 |Auk · Av − uk · v| ≤ ε||uk || ||v||. 3 def We now introduce the set of M compressed features (ψm )1≤m≤M such that ψm (x) = N We also write ψ(x) the M -vector of components (ψm (x))m≤M . Thus n=1 Am,n ϕn (x). ψ(x) = Aϕ(x). We define the compressed domain GM = {gβ = m=1 βm ψm , β ∈ RM } the span of the compressed features (vector space of dimension at most M ). Note that each ψm ∈ FN , thus GM is a subspace of FN . def 2.1 M Approximation error We now compare the approximation error assessed in the compressed domain GM versus in the initial space FN . This applies to the linear algorithms mentioned in the introduction such as ordinary LS regression (analyzed in details in Section 3), but also its penalized versions, e.g. LASSO and ridge regression. Define α+ = arg minα∈RN L(fα ) − L(f ∗ ) the parameter of the best regression function in FN . Theorem 1 For any δ > 0, any M ≥ 15 log(8K/δ), let A be a random M × N matrix defined like in Proposition 1, and GM be the compressed domain resulting from this choice of A. Then with probability at least 1 − δ, inf ||g−f ∗ ||2 ≤ P g∈GM 8 log(8K/δ) + 2 ||α || M E ||ϕ(X)||2 +2 sup ||ϕ(x)||2 x∈X log 4/δ + inf ||f −f ∗ ||2 . P f ∈FN 2K (2) This theorem shows the tradeoff in terms of estimation and approximation errors for an estimator g obtained in the compressed domain compared to an estimator f obtained in the initial domain: • Bounds on the estimation error of g in GM are usually smaller than that of f in FN when M < N (since the capacity of FN is larger than that of GM ). • Theorem 1 says that the approximation error assessed in GM increases by at most O( log(K/δ) )||α+ ||2 E||ϕ(X)||2 compared to that in FN . M def def Proof: Let us write f + = fα+ = arg minf ∈FN ||f − f ∗ ||P and g + = gAα+ . The approximation error assessed in the compressed domain GM is bounded as inf ||g − f ∗ ||2 P g∈GM ≤ ||g + − f ∗ ||2 = ||g + − f + ||2 + ||f + − f ∗ ||2 , P P P (3) since f + is the orthogonal projection of f ∗ on FN and g + belongs to FN . We now bound ||g + − def def f + ||2 using concentration inequalities. Define Z(x) = Aα+ · Aϕ(x) − α+ · ϕ(x). Define ε2 = P log(8K/δ) 8 M log(8K/δ). For M ≥ 15 log(8K/δ) we have ε < 3/4 thus M ≥ ε2 /4−ε3 /6 . Proposition 1 applies and says that on an event E of probability at least 1 − δ/2, we have for all k ≤ K, def |Z(xk )| ≤ ε||α+ || ||ϕ(xk )|| ≤ ε||α+ || sup ||ϕ(x)|| = C (4) x∈X On the event E, we have with probability at least 1 − δ , ||g + − f + ||2 P = ≤ ≤ EX∼PX |Z(X)|2 ≤ ε2 ||α+ ||2 ε2 ||α+ ||2 1 K 1 K K |Z(xk )|2 + C 2 k=1 K ||ϕ(xk )||2 + sup ||ϕ(x)||2 x∈X k=1 E ||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x∈X log(2/δ ) 2K log(2/δ ) 2K log(2/δ ) . 2K where we applied two times Chernoff-Hoeffding’s inequality. Combining with (3), unconditioning, and setting δ = δ/2 then with probability at least (1 − δ/2)(1 − δ ) ≥ 1 − δ we have (2). 4 2.2 Computational issues We now discuss the relative computational costs of a given algorithm applied either in the initial or in the compressed domain. Let us write Cx(DK , FN , P ) the complexity (e.g. number of elementary operations) of an algorithm A to compute the regression function f when provided with the data DK and function space FN . We plot in the table below, both for the initial and the compressed versions of the algorithm A, the order of complexity for (i) the cost for building the feature matrix, (ii) the cost for computing the estimator, (iii) the cost for making one prediction (i.e. computing f (x) for any x): Construction of the feature matrix Computing the regression function Making one prediction Initial domain NK Cx(DK , FN , P ) N Compressed domain N KM Cx(DK , GM , P ) NM Note that the values mentioned for the compressed domain are upper-bounds on the real complexity and do not take into account the possible sparsity of the projection matrix A (which would speed up matrix computations, see e.g. [2, 1]). 3 Compressed Least-Squares Regression We now analyze the specific case of Least-Squares Regression. 3.1 Excess risk of ordinary Least Squares regression In order to bound the estimation error, we follow the approach of [13] which truncates (up to the level ±L where L is a bound, assumed to be known, on ||f ∗ ||∞ ) the prediction of the LS regression function. The ordinary LS regression provides the regression function fα where b α= argmin α∈argminα ∈ RN ||α||. ||Y −Φα || Note that ΦΦT α = ΦT Y , hence α = Φ† Y ∈ RN where Φ† is the Penrose pseudo-inverse of Φ1 . def Then the truncated predictor is: fL (x) = TL [fα (x)], where b def TL (u) = u if |u| ≤ L, L sign(u) otherwise. Truncation after the computation of the parameter α ∈ RN , which is the solution of an unconstrained optimization problem, is easier than solving an optimization problem under the constraint that ||α|| is small (which is the approach followed in [23]) and allows for consistency results and prediction bounds. Indeed, the excess risk of fL is bounded as 1 + log K E(||f − f ∗ ||2 ) ≤ c max{σ2 , L2 } N + 8 inf ||f − f ∗ ||2 (5) P P f ∈FN K where a bound on c is 9216 (see [13]). We have a simpler bound when we consider the expectation EY conditionally on the input data: N EY (||f − f ∗ ||2 K ) ≤ σ2 + inf ||f − f ∗ ||2 K (6) P P K f ∈F Remark: Note that because we use the quadratic loss function, by following the analysis in [3], or by deriving tight bounds on the Rademacher complexity [14] and following Theorem 5.2 of Koltchinskii’s Saint Flour course, it is actually possible to state assumptions under which we can remove the log K term in (5). We will not further detail such bounds since our motivation here is not to provide the tightest possible bounds, but rather to show how the excess risk bound for LS regression in the initial domain extends to the compressed domain. 1 In the full rank case, Φ† = (ΦT Φ)−1 ΦT when K ≥ N and Φ† = ΦT (ΦΦT )−1 when K ≤ N 5 3.2 Compressed Least-Squares Regression (CLSR) CLSR is defined as the ordinary LSR in the compressed domain. Let β = Ψ† Y ∈ RM , where Ψ is the K × M matrix with elements (ψm (xk ))1≤m≤M,1≤k≤K . The CLSR estimate is defined as def gL (x) = TL [gβ (x)]. From Theorem 1, (5) and (6), we deduce the following excess risk bounds for b the CLSR estimate: √ ||α+ || E||ϕ(X)||2 K log(8K/δ) Corollary 1 For any δ > 0, set M = 8 max(σ,L) c (1+log K) . Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate is bounded as √ E(||gL − f ∗ ||2 ) ≤ 16 c max{σ, L}||α+ || E||ϕ(X)||2 P × 1+ supx ||ϕ(x)||2 E||ϕ(X)||2 (1 + log K) log(8K/δ) K log 4/δ + 8 inf ||f − f ∗ ||2 . P f ∈FN 2K (7) √ ||α+ || E||ϕ(X)||2 Now set M = 8K log(8K/δ). Assume N > K and that the features (ϕk )1≤k≤K σ are linearly independent. Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate conditionally on the input samples is upper bounded as 2 log(8K/δ) supx ||ϕ(x)||2 1+ K E||ϕ(X)||2 EY (||gL − f ∗ ||2 K ) ≤ 4σ||α+ || E||ϕ(X)||2 P log 4/δ . 2K Proof: Whenever M ≥ 15 log(8K/δ) we deduce from Theorem 1 and (5) that the excess risk of gL is bounded as E(||gL − f ∗ ||2 ) ≤ c max{σ2 , L2 } P +8 8 log(8K/δ) + 2 ||α || M 1 + log K M K E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 . P f ∈FN 2K By optimizing on M , we deduce (7). Similarly, using (6) we deduce the following bound on EY (||gL − f ∗ ||2 K ): P σ2 8 M + log(8K/δ)||α+ ||2 K M E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 K . P f ∈FN 2K By optimizing on M and noticing that inf f ∈FN ||f − f ∗ ||2 K = 0 whenever N > K and the features P (ϕk )1≤k≤K are linearly independent, we deduce the second result. Remark 1 Note that the second term in the parenthesis of (7) is negligible whenever K Thus we have the expected excess risk log K/δ + inf ||f − f ∗ ||2 . P f ∈FN K E(||gL − f ∗ ||2 ) = O ||α+ || E||ϕ(X)||2 √ P log 1/δ. (8) The choice of M in the previous corollary depends on ||α+ || and E||ϕ(X)|| which are a priori unknown (since f ∗ and PX are unknown). If we set M independently of ||α+ ||, then an additional multiplicative factor of ||α+ || appears in the bound, and if we replace E||ϕ(X)|| by its bound supx ||ϕ(x)|| (which is known) then this latter factor will appear instead of the former in the bound. Complexity of CLSR: The complexity of LSR for computing the regression function in the compressed domain only depends on M and K, and is (see e.g. [4]) Cx(DK , GM , P ) = O(M K 2 ) which √ is of order O(K 5/2 ) when we choose the optimized number of projections M = O( K). However the leading term when using CLSR is the cost for building the Ψ matrix: O(N K 3/2 ). 6 4 4.1 Discussion The factor ||α+ || E||ϕ(X)||2 In light of Corollary 1, the important factor which will determine whether the CLSR provides low generalization error or not is ||α+ || E||ϕ(X)||2 . This factor indicates that a good set of features (for CLSR) should be such that the norm of those features as well as the norm of the parameter α+ of the projection of f ∗ onto the span of those features should be small. A natural question is whether this product can be made small for appropriate choices of features. We now provide two specific cases for which this is actually the case: (1) when the features are rescaled orthonormal basis functions, and (2) when the features are specific wavelet functions. In both cases, we relate the bound to an assumption of regularity on the function f ∗ , and show that the dependency w.r.t. N decreases when the regularity increases, and may even vanish. Rescaled Orthonormal Features: Consider a set of orthonormal functions (ηi )i≥1 w.r.t a measure µ, i.e. ηi , ηj µ = δi,j . In addition we assume that the law of the input data is dominated by µ, i.e. PX ≤ Cµ where C is a constant. For instance, this is the case when the set X is compact, µ is the uniform measure and PX has bounded density. def We define the set of N features as: ϕi = ci ηi , where ci > 0, for i ∈ {1, . . . , N }. Then any f ∈ FN decomposes as f = 2 we have: ||α|| = ||α+ ||2 E||ϕ||2 ≤ C N bi 2 i=1 ( ci ) N bi 2 i=1 ( ci ) and N i=1 N bi i=1 ci ϕi , where N 2 2 i=1 ci X ηi (x)dPX (x) f, ηi ηi = E||ϕ|| = 2 def bi = f, ηi . Thus ≤ C N 2 i=1 ci . Thus N 2 i=1 ci . Now, linear approximation theory (Jackson-type theorems) tells us that assuming a function f ∗ ∈ L2 (µ) is smooth, it may be decomposed onto the span of the N first (ηi )i∈{1,...,N } functions with decreasing coefficients |bi | ≤ i−λ for some λ ≥ 0 that depends on the smoothness of f ∗ . For example the class of functions with bounded total variation may be decomposed with Fourier basis (in dimension 1) with coefficients |bi | ≤ ||f ||V /(2πi). Thus here λ = 1. Other classes (such as Sobolev spaces) lead to larger values of λ related to the order of differentiability. √ N By choosing ci = i−λ/2 , we have ||α+ || E||ϕ||2 ≤ C i=1 i−λ . Thus if λ > 1, then this term is bounded by a constant that does not depend on N . If λ = 1 then it is bounded by O(log N ), and if 0 < λ < 1, then it is bounded by O(N 1−λ ). However any orthonormal basis, even rescaled, would not necessarily yield a small ||α+ || E||ϕ||2 term (this is all the more true when the dimension of X is large). The desired property that the coefficients (α+ )i of the decomposition of f ∗ rapidly decrease to 0 indicates that hierarchical bases, such as wavelets, that would decompose the function at different scales, may be interesting. Wavelets: Consider an infinite family of wavelets in [0, 1]: (ϕ0 ) = (ϕ0 ) (indexed by n ≥ 1 or n h,l equivalently by the scale h ≥ 0 and translation 0 ≤ l ≤ 2h − 1) where ϕ0 (x) = 2h/2 ϕ0 (2h x − l) h,l and ϕ0 is the mother wavelet. Then consider N = 2H features (ϕh,l )1≤h≤H defined as the rescaled def wavelets ϕh,l = ch 2−h/2 ϕ0 , where ch > 0 are some coefficients. Assume the mother wavelet h,l is C p (for p ≥ 1), has at least p vanishing moments, and that for all h ≥ 0, supx l ϕ0 (2h x − l)2 ≤ 1. Then the following result (proof in the supplementary material) provides a bound on supx∈X ||ϕ(x)||2 (thus on E||ϕ(X)||2 ) by a constant independent of N : Proposition 2 Assume that f ∗ is (L, γ)-Lipschitz (i.e. for all v ∈ X there exists a polynomial pv of degree γ such that for all u ∈ X , |f (u) − pv (u)| ≤ L|u − v|γ ) with 1/2 < γ ≤ p. Then setting γ 1 ch = 2h(1−2γ)/4 , we have ||α+ || supx ||ϕ(x)|| ≤ L 1−22 |ϕ0 |, which is independent of N . 1/2−γ 0 Notice that the Haar walevets has p = 1 vanishing moment but is not C 1 , thus the Proposition does not apply directly. However direct computations show that if f ∗ is L-Lipschitz (i.e. γ = 1) then L 0 αh,l ≤ L2−3h/2−2 , and thus ||α+ || supx ||ϕ(x)|| ≤ 4(1−2−1/2 ) with ch = 2−h/4 . 7 4.2 Comparison with other methods In the case when the factor ||α+ || E||ϕ(X)||2 does not depend on N (such as in the previous example), the bound (8) on the excess risk of CLSR states that the estimation error (assessed in √ √ terms of FN ) of CLSR is O(log K/ K). It is clear that whenever N > K (which is the case of interest here), this is better than the ordinary LSR in the initial domain, whose estimation error is O(N log K/K). It is difficult to compare this result with LASSO (or the Dantzig selector that has similar properties [5]) for which an important aspect is to design sparse regression functions or to recover a solution assumed to be sparse. From [12, 15, 24] one deduces that under some assumptions, the estimation error of LASSO is of order S log N where S is the sparsity (number of non-zero coefficients) of the K√ best regressor f + in FN . If S < K then LASSO is more interesting than CLSR in terms of excess risk. Otherwise CLSR may be an interesting alternative although this method does not make any assumption about the sparsity of f + and its goal is not to recover a possible sparse f + but only to make good predictions. However, in some sense our method finds a sparse solution in the fact that the regression function gL lies in a space GM of small dimension M N and can thus be expressed using only M coefficients. Now in terms of numerical complexity, CLSR requires O(N K 3/2 ) operations to build the matrix and compute the regression function, whereas according to [18], the (heuristical) complexity of the LASSO algorithm is O(N K 2 ) in the best cases (assuming that the number of steps required for convergence is O(K), which is not proved theoretically). Thus CLSR seems to be a good and simple competitor to LASSO. 5 Conclusion We considered the case when the number of features N is larger than the number of data K. The result stated in Theorem 1 enables to analyze the excess risk of any linear regression algorithm (LS or its penalized versions) performed in the compressed domain GM versus in the initial space FN . In the compressed domain the estimation error is reduced but an additional (controlled) approximation error (when compared to the best regressor in FN ) comes into the picture. In the case of LS regression, when the term ||α+ || E||ϕ(X)||2 has a mild dependency on N , then by choosing a √ random subspace of dimension M = O( K), CLSR has an estimation error (assessed in terms of √ FN ) bounded by O(log K/ K) and has numerical complexity O(N K 3/2 ). In short, CLSR provides an alternative to usual penalization techniques where one first selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace. Further work needs to be done to provide additional settings (when the space X is of dimension > 1) for which the term ||α+ || E||ϕ(X)||2 is small. Acknowledgements: The authors wish to thank Laurent Jacques for numerous comments and Alessandro Lazaric and Mohammad Ghavamzadeh for exciting discussions. This work has been supported by French National Research Agency (ANR) through COSINUS program (project EXPLO-RA, ANR-08-COSI-004). References [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, June 2003. [2] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast JohnsonLindenstrauss transform. In STOC ’06: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557–563, New York, NY, USA, 2006. ACM. [3] Jean-Yves Audibert and Olivier Catoni. Risk bounds in linear regression through pac-bayesian truncation. Technical Report HAL : hal-00360268, 2009. [4] David Bau III and Lloyd N. Trefethen. Numerical linear algebra. Philadelphia: Society for Industrial and Applied Mathematics, 1997. 8 [5] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. To appear in Annals of Statistics, 2008. [6] Avrim Blum. Random projection, margins, kernels, and feature-selection. Subspace, Latent Structure and Feature Selection, pages 52–68, 2006. [7] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Technical Report, 2009. [8] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35:2313, 2007. [9] Emmanuel J. Candes and Justin K. Romberg. Signal recovery from random projections. volume 5674, pages 76–86. SPIE, 2005. [10] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20:33–61, 1998. [11] Mark A. Davenport, Michael B. Wakin, and Richard G. Baraniuk. Detection and estimation with compressive measurements. Technical Report TREE 0610, Department of Electrical and Computer Engineering, Rice University, 2006. [12] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10:971–988, 2004. [13] L. Gy¨ rfi, M. Kohler, A. Krzy˙ ak, and H. Walk. A distribution-free theory of nonparametric o z regression. Springer-Verlag, 2002. [14] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Neural Information Processing Systems, pages 793– 800. MIT Press, 2008. [15] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group Lasso estimator for linear models. Electron. J. Statist., 2:605–633, 2008. [16] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, New York, 1984. [17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Neural Information Processing Systems, 2007. [18] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012, 2007. [19] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. [20] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math Dokl 4, pages 1035–1038, 1963. [21] Yaakov Tsaig and David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289–1306, 2006. [22] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [23] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002. [24] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regularization. To appear in Annals of Statistics, 2009. [25] Shuheng Zhou, John D. Lafferty, and Larry A. Wasserman. Compressed regression. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Neural Information Processing Systems. MIT Press, 2007. 9
3 0.10740061 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
Author: Parikshit Ram, Dongryeol Lee, William March, Alexander G. Gray
Abstract: Several key computational bottlenecks in machine learning involve pairwise distance computations, including all-nearest-neighbors (ďŹ nding the nearest neighbor(s) for each point, e.g. in manifold learning) and kernel summations (e.g. in kernel density estimation or kernel machines). We consider the general, bichromatic case for these problems, in addition to the scientiďŹ c problem of N-body simulation. In this paper we show for the ďŹ rst time O(đ?‘ ) worst case runtimes for practical algorithms for these problems based on the cover tree data structure [1]. 1
4 0.083584629 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright
Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.
5 0.07963378 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
Author: Maxim Raginsky, Svetlana Lazebnik
Abstract: This paper addresses the problem of designing binary codes for high-dimensional data such that vectors that are similar in the original space map to similar binary strings. We introduce a simple distribution-free encoding scheme based on random projections, such that the expected Hamming distance between the binary codes of two vectors is related to the value of a shift-invariant kernel (e.g., a Gaussian kernel) between the vectors. We present a full theoretical analysis of the convergence properties of the proposed scheme, and report favorable experimental performance as compared to a recent state-of-the-art method, spectral hashing.
6 0.07938318 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
7 0.072753876 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
8 0.072323889 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
9 0.071198992 69 nips-2009-Discrete MDL Predicts in Total Variation
10 0.061373889 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
11 0.061092079 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
12 0.060726665 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
13 0.059724376 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
14 0.058814287 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm
15 0.0574793 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
16 0.056866448 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
17 0.056634106 140 nips-2009-Linearly constrained Bayesian matrix factorization for blind source separation
18 0.056351092 71 nips-2009-Distribution-Calibrated Hierarchical Classification
19 0.055569794 74 nips-2009-Efficient Bregman Range Search
20 0.054661721 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
topicId topicWeight
[(0, -0.164), (1, 0.122), (2, -0.02), (3, 0.063), (4, -0.04), (5, -0.027), (6, -0.02), (7, 0.004), (8, 0.001), (9, 0.007), (10, 0.004), (11, 0.001), (12, 0.082), (13, 0.018), (14, 0.11), (15, -0.015), (16, 0.088), (17, 0.063), (18, 0.112), (19, 0.038), (20, 0.027), (21, 0.092), (22, 0.042), (23, -0.045), (24, 0.001), (25, -0.056), (26, 0.008), (27, 0.029), (28, 0.032), (29, -0.008), (30, 0.063), (31, 0.008), (32, 0.014), (33, -0.041), (34, -0.156), (35, -0.028), (36, 0.08), (37, -0.005), (38, 0.019), (39, 0.062), (40, 0.031), (41, -0.005), (42, -0.012), (43, 0.022), (44, -0.025), (45, 0.002), (46, 0.079), (47, 0.021), (48, -0.022), (49, -0.156)]
simIndex simValue paperId paperTitle
same-paper 1 0.92476892 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
Author: Samory Kpotufe
Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1
2 0.60599077 74 nips-2009-Efficient Bregman Range Search
Author: Lawrence Cayton
Abstract: We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe an algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive geometric properties of Bregman divergences that yield an efficient algorithm for range search based on a recently proposed space decomposition for Bregman divergences. 1
3 0.59399486 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan
Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1
4 0.58394837 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
Author: Matthias Hein
Abstract: Motivated by recent developments in manifold-valued regression we propose a family of nonparametric kernel-smoothing estimators with metric-space valued output including several robust versions. Depending on the choice of the output space and the metric the estimator reduces to partially well-known procedures for multi-class classification, multivariate regression in Euclidean space, regression with manifold-valued output and even some cases of structured output learning. In this paper we focus on the case of regression with manifold-valued input and output. We show pointwise and Bayes consistency for all estimators in the family for the case of manifold-valued output and illustrate the robustness properties of the estimators with experiments. 1
5 0.57061142 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties
Author: Sylvain Arlot, Francis R. Bach
Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which first estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves significantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1
6 0.56887871 55 nips-2009-Compressed Least-Squares Regression
7 0.56401771 238 nips-2009-Submanifold density estimation
8 0.53009069 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
9 0.49155864 198 nips-2009-Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions
10 0.48543051 69 nips-2009-Discrete MDL Predicts in Total Variation
11 0.47588295 94 nips-2009-Fast Learning from Non-i.i.d. Observations
12 0.46973521 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable
13 0.43894783 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
14 0.43331912 71 nips-2009-Distribution-Calibrated Hierarchical Classification
15 0.4172487 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
16 0.41252369 124 nips-2009-Lattice Regression
17 0.38060835 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
18 0.38023371 67 nips-2009-Directed Regression
19 0.37990507 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
20 0.37912402 141 nips-2009-Local Rules for Global MAP: When Do They Work ?
topicId topicWeight
[(7, 0.011), (24, 0.093), (25, 0.091), (35, 0.06), (36, 0.097), (39, 0.057), (58, 0.064), (61, 0.055), (71, 0.027), (81, 0.02), (86, 0.056), (91, 0.022), (95, 0.262)]
simIndex simValue paperId paperTitle
same-paper 1 0.80611062 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
Author: Samory Kpotufe
Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1
2 0.7314738 239 nips-2009-Submodularity Cuts and Applications
Author: Yoshinobu Kawahara, Kiyohito Nagano, Koji Tsuda, Jeff A. Bilmes
Abstract: Several key problems in machine learning, such as feature selection and active learning, can be formulated as submodular set function maximization. We present herein a novel algorithm for maximizing a submodular set function under a cardinality constraint — the algorithm is based on a cutting-plane method and is implemented as an iterative small-scale binary-integer linear programming procedure. It is well known that this problem is NP-hard, and the approximation factor achieved by the greedy algorithm is the theoretical limit for polynomial time. As for (non-polynomial time) exact algorithms that perform reasonably in practice, there has been very little in the literature although the problem is quite important for many applications. Our algorithm is guaranteed to find the exact solution finitely many iterations, and it converges fast in practice due to the efficiency of the cutting-plane mechanism. Moreover, we also provide a method that produces successively decreasing upper-bounds of the optimal solution, while our algorithm provides successively increasing lower-bounds. Thus, the accuracy of the current solution can be estimated at any point, and the algorithm can be stopped early once a desired degree of tolerance is met. We evaluate our algorithm on sensor placement and feature selection applications showing good performance. 1
3 0.57914436 122 nips-2009-Label Selection on Graphs
Author: Andrew Guillory, Jeff A. Bilmes
Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1
4 0.57891977 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan
Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1
5 0.57590723 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
Author: Kai Yu, Tong Zhang, Yihong Gong
Abstract: This paper introduces a new method for semi-supervised learning on high dimensional nonlinear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point x on the manifold can be locally approximated by a linear combination of its nearby anchor points, and the linear weights become its local coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. 1
6 0.57476681 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
7 0.57420266 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
8 0.57378531 33 nips-2009-Analysis of SVM with Indefinite Kernels
10 0.57157618 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
11 0.57112014 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
12 0.57047164 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
13 0.57027036 78 nips-2009-Efficient Moments-based Permutation Tests
14 0.57020897 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
15 0.56955242 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
16 0.56892776 71 nips-2009-Distribution-Calibrated Hierarchical Classification
17 0.56828129 113 nips-2009-Improving Existing Fault Recovery Policies
18 0.56758559 94 nips-2009-Fast Learning from Non-i.i.d. Observations
19 0.56743151 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
20 0.56681633 97 nips-2009-Free energy score space