nips nips2009 nips2009-144 knowledge-graph by maker-knowledge-mining

144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness


Source: pdf

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness Garvesh Raskutti1 , Martin J. [sent-1, score-1.027]

2 Wainwright1,2 , Bin Yu1,2 1 UC Berkeley Department of Statistics 2 UC Berkeley Department of Electrical Engineering and Computer Science Abstract We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. [sent-2, score-0.939]

3 , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . [sent-6, score-0.487]

4 observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , . [sent-13, score-0.261]

5 components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. [sent-19, score-0.76]

6 Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . [sent-20, score-0.669]

7 The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. [sent-22, score-0.555]

8 As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . [sent-24, score-0.225]

9 Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition. [sent-25, score-0.403]

10 , xp )+w, where w ∼ N (0, σ 2 ) is additive observation noise. [sent-37, score-0.335]

11 , xp ) = j=1 h∗ (xj ) of univariate functions h∗ . [sent-42, score-0.442]

12 A natural sub-class of these j j 1 models are the sparse additive models, studied by Ravikumar et. [sent-43, score-0.272]

13 [12] propose a back-fitting algorithm to recover the component functions hj and prove consistency in both subset recovery and consistency in empirical L2 (Pn ) norm. [sent-54, score-0.44]

14 Of complementary interest to the rates achievable by practical methods are the fundamental limits of the estimating sparse additive models, meaning lower bounds that apply to any algorithm. [sent-59, score-0.738]

15 Although such lower bounds are well-known under classical scaling (where p remains fixed independent of n), to the best of our knowledge, lower bounds for minimax rates on sparse additive models have not been determined. [sent-60, score-1.228]

16 In this paper, our main result is to establish a lower bound on the minimax rate in L2 (P) norm that scales as max s log(p/s) , sǫ2 (H) . [sent-61, score-0.7]

17 n n The first term s log(p/s) is a subset selection term, independent of the univariate function space H in which n the additive components lie, that reflects the difficulty of finding the subset S. [sent-62, score-0.737]

18 The second term sǫ2 (H) in an n s-dimensional estimation term, which depends on the low dimension s but not the ambient dimension p, and reflects the difficulty of estimating the sum of s univariate functions, each drawn from function class H. [sent-63, score-0.603]

19 Either the subset selection or s-dimensional estimation term dominates, depending on the relative sizes of n, p, and s as well as H. [sent-64, score-0.308]

20 Our analysis is based on informationtheoretic techniques centered around the use of metric entropy, mutual information and Fano’s inequality in order to obtain lower bounds. [sent-66, score-0.316]

21 Such techniques are standard in the analysis of non-parametric procedures under classical scaling [5, 2, 17], and have also been used more recently to develop lower bounds for high-dimensional inference problems [16, 11]. [sent-67, score-0.272]

22 Given a base class H of univariate functions with norm · H , consider the class of functions f : Rp → R that have an additive decomposition: p F : = f : Rp → R | f (x1 , x2 , . [sent-82, score-0.722]

23 , xp ) = hj (xj ), and hj j=1 H ≤1 ∀j = 1, . [sent-85, score-0.561]

24 The minimax rate of estimation over F0 (s) is defined by the quantity minf maxf ∗ ∈F0 (s) E f −f ∗ b (3) 2 L2 (P) , where the expectation is taken over the noise w, and randomness in the sampling, and f ranges over all (measurable) 2 functions of the observations {(y (i) , X (i) )}n . [sent-93, score-0.577]

25 The goal of this paper is to determine lower bounds on this i=1 minimax rate. [sent-94, score-0.567]

26 1 Inner products and norms Given a univariate function hj ∈ H, we define the usual L2 (P) inner product h j , h′ j L2 (P) hj (x)h′ (x) dP(x). [sent-96, score-0.718]

27 Without loss of generality (re-centering the functions as needed), we may assume E[hj (X)] = hj (x) dP(x) = 0, R for all hj ∈ H. [sent-98, score-0.494]

28 , Xp ) has independent components, the L2 (P) inner product p on F has the additive decomposition f, f ′ L2 (P) = j=1 hj , h′ L2 (P) . [sent-106, score-0.457]

29 3 Metric entropy for function classes In this section, we define the notion of metric entropy, which provides a way in which to measure the relative sizes of different function classes with respect to some metric ρ. [sent-122, score-0.665]

30 More specifically, central to our results is the metric entropy of F0 (s) with respect to the L2 (P) norm. [sent-123, score-0.403]

31 Consider a metric space consisting of a set S and a metric ρ : S × S → R+ . [sent-125, score-0.328]

32 The covering and packing entropy (denoted by log Nρ (ǫ) and log Mρ (ǫ) respectively) are simply the logarithms of the covering and packing numbers, respectively. [sent-139, score-1.373]

33 It can be shown that for any convex set, the quantities log Nρ (ǫ) and log Mρ (ǫ) are of the same order (within constant factors independent of ǫ). [sent-140, score-0.336]

34 3 In this paper, we are interested in packing (and covering) subsets of the function class F0 (s) in the L2 (P) metric, and so drop the subscript ρ from here onwards. [sent-141, score-0.335]

35 En route to characterizing the metric entropy of F0 (s), we need to understand the metric entropy of the unit balls of our univariate function class H—namely, the sets BH (1) : = {h ∈ H | h H ≤ 1}. [sent-142, score-1.112]

36 The metric entropy (both covering and packing entropy) for many classes of functions are known. [sent-143, score-0.913]

37 We provide some concrete examples here: (i) Consider the class H = {hβ : R → R | hβ (x) = βx} of all univariate linear functions with the norm hβ H = |β|. [sent-144, score-0.399]

38 Then it is known [15] that the metric entropy of BH (1) scales as log M (ǫ; H) ∼ log(1/ǫ). [sent-145, score-0.624]

39 In this case, it is known [15] that the metric entropy scales as log M H (ǫ; H) ∼ 1/ǫ. [sent-147, score-0.624]

40 Compared to the previous example of linear models, note that the metric entropy grows much faster as ǫ → 0, indicating that the class of Lipschitz functions is much richer. [sent-148, score-0.548]

41 Such Sobolev spaces are a particular class of functions whose packing/covering entropy grows at a rate polynomial in 1 . [sent-154, score-0.491]

42 ǫ In our analysis, we require that the metric entropy of BH (1) satisfy the following technical condition: Assumption 1. [sent-155, score-0.403]

43 Using log M (ǫ; H) to denote the packing entropy of the unit ball BH (1) in the L2 (P)-norm, assume that there exists some α ∈ (0, 1) such that log M (αǫ; H) > 1. [sent-156, score-0.824]

44 ǫ→0 log M (ǫ; H) lim The condition is required to ensure that log M (cǫ)/ log M (ǫ) can be made arbitrarily small or large uniformly over small ǫ by changing c, so that a bound due to Yang and Barron [17] can be applied. [sent-157, score-0.562]

45 It may fail to hold for certain parametric classes, such as the set of linear functions considered in Example (i); however, we can use an alternative technique to derive bounds for the parametric case (see Corollary 2). [sent-159, score-0.356]

46 We begin with a theorem that covers the function class F0 (s) in which the univariate function classes H have metric entropy that satisfies Assumption 1. [sent-161, score-0.758]

47 We state a corollary for the special cases of univariate classes H with metric entropy growing polynomial in (1/ǫ), and also a corollary for the special case of sparse linear regression. [sent-162, score-0.963]

48 Suppose that the univariate function class H that underlies F0 (s) satisfies Assumption 1. [sent-167, score-0.306]

49 2σ 2 (7) For the case where H has an entropy that is growing to ∞ at a polynomial rate as ǫ → 0—say log M (ǫ; H) = Θ(ǫ−1/m ) for some m > 1 , we can compute the rate for the s-dimensional estimation term explicitly. [sent-173, score-0.743]

50 For the sparse additive model (2) with univariate function space H such that such that log M (ǫ; H) = Θ(ǫ−1/m ), we have min ∗max E f − f ∗ b f f ∈F0 (s) 2 L2 (P) ≥ max σ 2 s log(p/s) σ2 ,C s 32n n 2m 2m+1 , (8) for some C > 0. [sent-175, score-0.807]

51 , functions in the 2m Sobolev space W m ), the minimax rate is n− 2m+1 (for details, see e. [sent-180, score-0.456]

52 Clearly, faster rates are obtained for larger smoothness indices m, and as m → ∞, the rate approaches the parametric rate of n−1 . [sent-183, score-0.395]

53 Since we are estimating over an s-dimensional space (under the assumption of independence), we are effectively estimating s univariate functions, each lying within the function space H. [sent-184, score-0.4]

54 Smoothness versus sparsity: It is worth noting that depending on the relative scalings of s, n and p and the metric entropy of H, it is possible for either the subset selection term or s-dimensional estimation term to dominate the lower bound. [sent-186, score-0.872]

55 In general, if log(p/s) = o(ǫ2 (H)), the s-dimensional estimation term dominates, and vice n n versa (at the boundary, either term determines the minimax rate). [sent-187, score-0.513]

56 In the case of a univariate function class H with polynomial entropy as in Corollary 1, it can be seen that for n = o((log(p/s))2m+1 ), the s-dimensional estimation term dominates while for n = Ω((log(p/s))2m+1 ), the subset selection term dominates. [sent-188, score-0.977]

57 Rates for linear models: Using an alternative proof technique (not the one used in this paper), it is possible [11] to derive the exact minimax rate for estimation in the sparse linear regression model, in which we observe (i) y (i) = βj Xj + w(i) , for i = 1, 2, . [sent-189, score-0.666]

58 (9) j∈S Note that this is a special case of the general model (2) in which H corresponds to the class of univariate linear functions (see Example (i)). [sent-193, score-0.368]

59 For sparse linear regression model (9), the the minimax rate scales as max s log(p/s) s ,n n . [sent-195, score-0.64]

60 In this case, we see clearly the subset selection term dominates for p → ∞, meaning the subset selection problem is always “harder” (in a statistical sense) than the s-dimensional estimation problem. [sent-196, score-0.532]

61 [1], the rate achieved by ℓ1 -regularized methods is s log p under suitable conditions on the n covariates X. [sent-198, score-0.313]

62 Upper bounds: To show that the lower bounds are tight, upper bounds that are matching need to be derived. [sent-199, score-0.39]

63 , [5, 2]), which involves constructing an estimator based on a covering set and bounding the covering entropy of F0 (s). [sent-202, score-0.607]

64 While this estimation approach does not lead to an implementable algorithm, it is a simple theoretical device to demonstrate that lower bounds are tight. [sent-203, score-0.384]

65 Comparison to existing bounds: We now provide a brief comparison of the minimax lower bounds with upper bounds on rates achieved by existing implementable algorithms provided by past work [12, 7, 9]. [sent-205, score-0.899]

66 The rates derived in Koltchinskii and Yuan [7] do not match the lower bounds derived in Theorem 1. [sent-208, score-0.422]

67 [9] with our minimax lower bounds since their analysis does not explicitly track the sparsity index s. [sent-211, score-0.634]

68 The proof is based on a combination of information-theoretic 5 techniques and the concepts of packing and covering entropy, as defined previously in Section 2. [sent-214, score-0.442]

69 The basic idea is to carefully choose two subsets T1 and T2 of the function class F0 (s) and lower bound the minimax rates over these two subsets. [sent-217, score-0.713]

70 1, application of the generalized Fano method—a technique based on Fano’s inequality—to the set T1 defined in equation (10) yields a lower bound on the subset selection term. [sent-219, score-0.365]

71 2, we apply an alternative method for obtaining lower bounds over a second set T2 defined in equation (11) that captures the difficulty of estimating the sum of s univariate functions. [sent-221, score-0.554]

72 Before procedding, we first note that for any T ⊂ F0 (s), we have 2 L2 (P) min ∗max E f − f ∗ b f f ∈F0 (s) ≥ min max E f − f ∗ ∗ b f f ∈T Moreover, for any subsets T1 , T2 ⊂ F0 (s), we have min ∗max E f − f ∗ b f f ∈F0 (s) 2 L2 (P) ≥ max min max E f − f ∗ ∗ b f f ∈T1 2 L2 (P) . [sent-224, score-0.431]

73 Then the minimax risk over the family is lower bounded as βr + log 2 αr 1− . [sent-238, score-0.622]

74 max Ej d(θ(Pj ), θ) ≥ j 2 log r The proof of Lemma 1 involves applying Fano’s inequality over the discrete set of parameters θ ∈ Θ indexed by the set of distributions Mr . [sent-239, score-0.358]

75 This hypercube construction is often used to prove lower bounds (see Yu [18]). [sent-247, score-0.282]

76 There exists a subset A ⊂ T1 such that: (i) log |A| ≥ 1 s log(p/s), 2 2 (ii) f − f ′ 2 2 (P) ≥ σ s log(p/s) ∀ f, f ′ ∈ A, and L 16n 1 (iii) D(f f ′ ) ≤ 8 s log(p/s) ∀f, f ′ ∈ A. [sent-250, score-0.237]

77 For s log p ≥ 8 log 2, applying the Generalized Fano Method (Lemma 1) together u s with Lemma 2 yields the bound σ 2 s log(p/s) . [sent-253, score-0.394]

78 min ∗max E f − f ∗ 2 2 (P) ≥ min max E f − f ∗ 2 2 (P) ≥ L L ∗ ∈A b b 32n f f ∈F0 (s) f f This completes the proof for the subset selection term ( s log(p/s) ) in Theorem 1. [sent-254, score-0.422]

79 2 Bounding the complexity of s-dimensional estimation Next we derive a bound for the s-dimensional estimation term by determining a lower bound over T2 . [sent-256, score-0.415]

80 , p}, and define the set FS as T2 : = FS : = f ∈ F : f (X) = hj (Xj ) . [sent-259, score-0.216]

81 We use a technique used in Yang and Barron [17] to lower bound the minimax rate over FS . [sent-261, score-0.583]

82 The idea is to construct a maximal δn -packing set for FS and a minimal ǫn -covering set for FS , and then to apply Fano’s inequality to a carefully chosen mixture distribution involving the covering and packing sets (see the full-length version for details). [sent-262, score-0.481]

83 min max E f − f ∗ 2 2 (P) ≥ n 1 − L ∗ b 4 log M (δn ; FS ) f f ∈FS Now we have a bound with expressions involving the covering and packing entropies of the s-dimensional space FS . [sent-265, score-0.782]

84 The following Lemma allows bounds on log M (ǫ; FS ) and log N (ǫ; FS ) in terms of the unidimensional packing and covering entropies respectively: Lemma 4. [sent-266, score-0.921]

85 Let H be function space with a packing entropy log M (ǫ; H) that satisfies Assumption 1. [sent-267, score-0.656]

86 Then we have the bounds √ √ log M (ǫ; FS ) ≥ s log M (ǫ/ s; H), and log N (ǫ; FS ) ≤ s log N (ǫ/ s; H). [sent-268, score-0.817]

87 ǫ The proof involves constructing √s - packing set and covering sets in each of the s dimensions and displaying that these are ǫ-packing and coverings sets in FS (respectively). [sent-269, score-0.473]

88 Combining Lemmas 3 and 4 leads to the inequality √ s log N (ǫn / s; H) + nǫ2 /2σ 2 + log 2 δ2 √n . [sent-270, score-0.388]

89 (12) min max E f − f ∗ 2 2 (P) ≥ n 1 − L ∗ b 4 s log M (δn / s; H) f f ∈FS Now we choose ǫn and δn to meet the following constraints: ǫn n 2 ǫ ≤ s log N ( √ ; H), 2σ 2 n s δn ǫn 4 log N ( √ ; H) ≤ log M ( √ ; H). [sent-271, score-0.788]

90 Furthermore, if we define δn / s = δn , then this inequality can 2 s f2 be re-expressed as log M (cδn ) ≥ nδn2 . [sent-273, score-0.22]

91 For 2σ equation (12) yields the desired rate n 2 2σ 2 ǫn ≥ log 2, using inequalities (13a) and (13b) together with 2 min max E f − ∗ b f f ∈FS f ∗ 2 2 (P) L sδn ≥ , 16 thereby completing the proof. [sent-274, score-0.356]

92 5 Discussion In this paper, we have derived lower bounds for the minimax risk in squared L2 (P) error for estimating sparse additive models based on the sum of univariate functions from a function class H. [sent-275, score-1.327]

93 The rates show that the estimation problem effectively decomposes into a subset selection problem and an s-dimensional estimation 7 problem, and the “harder” of the two problems (in a statistical sense) determines the rate of convergence. [sent-276, score-0.477]

94 More concretely, we demonstrated that the subset selection term scales as s log(p/s) , depending linearly on n the number of components s and only logarithmically in the ambient dimension p. [sent-277, score-0.375]

95 This subset selection term is independent of the univariate function space H. [sent-278, score-0.462]

96 On the other hand, the s-dimensional estimation term depends on the “richness” of the univariate function class, measured by its metric entropy; it scales linearly with s and is independent of p. [sent-279, score-0.598]

97 Ongoing work suggests that our lower bounds are tight in many cases, meaning that the rates derived in Theorem 1 are minimax optimal for many function classes. [sent-280, score-0.76]

98 It would also be interesting to develop a more complete understanding of whether computationally efficient algorithms [7, 12, 9] based on regularization achieve the lower bounds on the minimax rate derived in this paper. [sent-288, score-0.669]

99 Minimax rates of estimation for high-dimensional linear regression over ℓq -balls. [sent-354, score-0.249]

100 Information-theoretic bounds for sparsity recovery in the high-dimensional and noisy setting. [sent-380, score-0.253]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fs', 0.356), ('minimax', 0.322), ('univariate', 0.251), ('packing', 0.249), ('fano', 0.241), ('entropy', 0.239), ('hj', 0.216), ('additive', 0.206), ('log', 0.168), ('metric', 0.164), ('covering', 0.15), ('bounds', 0.145), ('xp', 0.129), ('rates', 0.117), ('sobolev', 0.104), ('lower', 0.1), ('ravikumar', 0.088), ('bh', 0.083), ('selection', 0.081), ('xj', 0.075), ('rp', 0.075), ('smoothness', 0.075), ('covariates', 0.073), ('rate', 0.072), ('implementable', 0.07), ('meier', 0.07), ('barron', 0.07), ('estimation', 0.069), ('subset', 0.069), ('sparsity', 0.067), ('corollary', 0.066), ('sparse', 0.066), ('max', 0.064), ('regression', 0.063), ('functions', 0.062), ('koltchinskii', 0.062), ('term', 0.061), ('lemma', 0.061), ('parametric', 0.059), ('estimating', 0.058), ('bound', 0.058), ('ambient', 0.057), ('consequences', 0.056), ('dominates', 0.056), ('covariate', 0.055), ('class', 0.055), ('pj', 0.053), ('scales', 0.053), ('inequality', 0.052), ('min', 0.052), ('mr', 0.05), ('classes', 0.049), ('annals', 0.049), ('yang', 0.048), ('meaning', 0.046), ('lasso', 0.046), ('divergence', 0.043), ('proof', 0.043), ('uc', 0.043), ('cardinality', 0.042), ('entropies', 0.041), ('recovery', 0.041), ('bickel', 0.038), ('bounding', 0.037), ('hypercube', 0.037), ('inner', 0.035), ('polynomial', 0.035), ('cj', 0.034), ('mth', 0.033), ('yuan', 0.033), ('assumption', 0.033), ('risk', 0.032), ('nonparametric', 0.032), ('norm', 0.031), ('involves', 0.031), ('technique', 0.031), ('subsets', 0.031), ('berkeley', 0.031), ('derived', 0.03), ('ects', 0.03), ('carefully', 0.03), ('harder', 0.03), ('lipschitz', 0.028), ('culty', 0.028), ('iii', 0.028), ('grows', 0.028), ('depending', 0.028), ('growing', 0.027), ('estimators', 0.027), ('outline', 0.027), ('classical', 0.027), ('relation', 0.027), ('dimension', 0.026), ('consistency', 0.026), ('generalized', 0.026), ('raskutti', 0.026), ('minf', 0.026), ('maxf', 0.026), ('cam', 0.026), ('espaces', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

2 0.20996012 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

3 0.18852249 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

4 0.17237249 47 nips-2009-Boosting with Spatial Regularization

Author: Yongxin Xi, Uri Hasson, Peter J. Ramadge, Zhen J. Xiang

Abstract: By adding a spatial regularization kernel to a standard loss function formulation of the boosting problem, we develop a framework for spatially informed boosting. From this regularized loss framework we derive an efficient boosting algorithm that uses additional weights/priors on the base classifiers. We prove that the proposed algorithm exhibits a “grouping effect”, which encourages the selection of all spatially local, discriminative base classifiers. The algorithm’s primary advantage is in applications where the trained classifier is used to identify the spatial pattern of discriminative information, e.g. the voxel selection problem in fMRI. We demonstrate the algorithm’s performance on various data sets. 1

5 0.1435862 11 nips-2009-A General Projection Property for Distribution Families

Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári

Abstract: Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes. 1

6 0.13434656 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

7 0.12912895 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

8 0.12693462 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

9 0.12340339 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

10 0.11831281 48 nips-2009-Bootstrapping from Game Tree Search

11 0.10412663 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

12 0.0971426 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models

13 0.094514333 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

14 0.085083209 161 nips-2009-Nash Equilibria of Static Prediction Games

15 0.08461985 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

16 0.083794534 223 nips-2009-Sparse Metric Learning via Smooth Optimization

17 0.083584629 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

18 0.08200267 129 nips-2009-Learning a Small Mixture of Trees

19 0.079782687 55 nips-2009-Compressed Least-Squares Regression

20 0.078355238 147 nips-2009-Matrix Completion from Noisy Entries


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.235), (1, 0.187), (2, 0.027), (3, 0.092), (4, 0.017), (5, -0.046), (6, 0.027), (7, -0.15), (8, 0.057), (9, 0.058), (10, -0.044), (11, -0.071), (12, 0.046), (13, 0.046), (14, 0.179), (15, -0.014), (16, 0.07), (17, -0.073), (18, 0.086), (19, 0.063), (20, -0.074), (21, -0.033), (22, -0.117), (23, 0.042), (24, -0.12), (25, -0.078), (26, -0.012), (27, 0.024), (28, -0.113), (29, -0.04), (30, -0.085), (31, 0.049), (32, -0.042), (33, 0.081), (34, -0.015), (35, -0.085), (36, -0.296), (37, 0.112), (38, -0.031), (39, 0.024), (40, -0.057), (41, 0.09), (42, -0.07), (43, 0.053), (44, -0.138), (45, 0.038), (46, -0.101), (47, -0.048), (48, 0.101), (49, 0.108)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96339184 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

2 0.65620536 11 nips-2009-A General Projection Property for Distribution Families

Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári

Abstract: Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes. 1

3 0.64965755 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

Author: Han Liu, Xi Chen

Abstract: This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate models, we propose an algorithm called generalized forward regression. Both algorithms simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-ofthe-art competitors, including LASSO, a nonparametric version of LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called Foba. We also provide some theoretical justifications of specific versions of the additive forward regression. 1

4 0.61899054 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

5 0.61200154 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

Author: Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, Pradeep K. Ravikumar

Abstract: Despite a large literature on upper bounds on complexity of convex optimization, relatively less attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining a understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes. We also discuss implications of these results for the understanding the inherent complexity of large-scale learning and estimation problems. 1

6 0.58667725 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

7 0.51713854 48 nips-2009-Bootstrapping from Game Tree Search

8 0.49697381 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

9 0.464266 47 nips-2009-Boosting with Spatial Regularization

10 0.46055758 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

11 0.43062449 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

12 0.43020737 223 nips-2009-Sparse Metric Learning via Smooth Optimization

13 0.42703477 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

14 0.41606387 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

15 0.41182196 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

16 0.39918974 129 nips-2009-Learning a Small Mixture of Trees

17 0.3976135 94 nips-2009-Fast Learning from Non-i.i.d. Observations

18 0.39391369 161 nips-2009-Nash Equilibria of Static Prediction Games

19 0.38529924 7 nips-2009-A Data-Driven Approach to Modeling Choice

20 0.37996599 191 nips-2009-Positive Semidefinite Metric Learning with Boosting


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.505), (25, 0.069), (35, 0.03), (36, 0.105), (39, 0.023), (58, 0.092), (61, 0.022), (71, 0.023), (81, 0.011), (86, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98655099 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The first term reflects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

2 0.96938711 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

3 0.96048141 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufficient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classification boundary and the underlying distribution are smooth to a finite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are infinitely smooth, the improvement is exponential.

4 0.93877465 221 nips-2009-Solving Stochastic Games

Author: Liam M. Dermed, Charles L. Isbell

Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1

5 0.92023104 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori

Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1

6 0.8133834 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

7 0.78721172 45 nips-2009-Beyond Convexity: Online Submodular Minimization

8 0.74149692 161 nips-2009-Nash Equilibria of Static Prediction Games

9 0.72146195 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games

10 0.70037919 239 nips-2009-Submodularity Cuts and Applications

11 0.69653755 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

12 0.67836338 122 nips-2009-Label Selection on Graphs

13 0.6778881 94 nips-2009-Fast Learning from Non-i.i.d. Observations

14 0.67759234 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

15 0.67709899 14 nips-2009-A Parameter-free Hedging Algorithm

16 0.66713959 55 nips-2009-Compressed Least-Squares Regression

17 0.66121721 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

18 0.6598888 178 nips-2009-On Stochastic and Worst-case Models for Investing

19 0.6526019 147 nips-2009-Matrix Completion from Noisy Entries

20 0.63902092 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank