nips nips2009 nips2009-37 knowledge-graph by maker-knowledge-mining

37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Source: pdf

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a uniﬁed asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. [sent-11, score-0.144]

2 In this paper, we present a uniﬁed asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. [sent-12, score-0.138]

3 In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. [sent-13, score-0.17]

4 Although empirical risk minimization has favorable limiting properties, it is well known that this procedure can overﬁt on ﬁnite data. [sent-16, score-0.359]

5 Hence, various forms of regularization have been employed to control this overﬁtting. [sent-17, score-0.144]

6 For example, in classiﬁcation, we might use L2 regularization if we expect the data to be separable with a large margin. [sent-19, score-0.162]

7 In this paper, we address these two points by developing an asymptotic analysis of smooth regularizers for parametric problems. [sent-23, score-0.223]

8 The key idea is to derive a second-order Taylor approximation of the expected risk, yielding a simple and interpretable quadratic form which can be directly minimized with respect to the regularization parameters. [sent-24, score-0.214]

9 Our goal is to minimize the expected risk, def θ∞ = argmin L(θ), def L(θ) = EZ∼p∗ [ (Z; θ)], (1) θ∈Rd which averages the loss over some true data generating distribution p∗ (Z). [sent-43, score-0.85]

10 The standard unregularized estimator minimizes the empirical risk: ˆ0 def θn = argmin Ln (θ), def Ln (θ) = θ∈Rd 1 n n (Zi , θ). [sent-51, score-0.976]

11 (2) i=1 ˆ0 Although θn is consistent (that is, it converges in probability to θ∞ ) under relatively weak conditions, it is well known that regularization can improve performance substantially for ﬁnite n. [sent-52, score-0.144]

12 Let Rn (λ, θ) be a (possibly data-dependent) regularization function, where λ ∈ Rb are the regularizaλ tion parameters. [sent-53, score-0.144]

13 For linear regression, we might use squared regularization (Rn (λ, θ) = 2n θ 2 ), where λ ∈ R determines the strength. [sent-54, score-0.189]

14 Deﬁne the regularized estimator as follows: ˆ def θλ = argmin Ln (θ) + Rn (λ, θ). [sent-55, score-0.564]

15 Speciﬁcally, we wish to minimize the relative risk: def ˆ ˆ Ln (λ) = EZ ,. [sent-57, score-0.413]

16 ,Z ∼p∗ [L(θλ ) − L(θ0 )], (4) 1 n n n which is the difference in risk (averaged over the training data) between the regularized and unregularized estimators; Ln (λ) < 0 is desirable. [sent-60, score-0.409]

17 Clearly, argminλ Ln (λ) is the optimal regularization parameter. [sent-61, score-0.144]

18 Therefore, the main focus of this work is on deriving an asymptotic expansion for Ln (λ). [sent-63, score-0.131]

19 The regularizer Rn (λ, θ) is thrice-differentiable with respect P to θ and differentiable with respect to λ. [sent-70, score-0.318]

20 2 Rate of regularization strength Let us establish some basic properties that the regularizer Rn (λ, θ) should satisfy. [sent-73, score-0.479]

21 , convergence to the parameters that achieve the minimum → possible risk in our hypothesis class. [sent-76, score-0.337]

22 To achieve this, it sufﬁces (and in general also necessitates) that (1) the loss class satisﬁes standard uniform convergence properties [22] and (2) the regularizer P has a vanishing impact in the limit of inﬁnite data (Rn (λ, θ) − 0). [sent-77, score-0.365]

23 As we show in [16], Rn (λ, θ) = Op (n−1 ) is the rate that minimizes the relative risk Ln . [sent-80, score-0.41]

24 With this rate, it is natural to consider the regularizer as a prior p(θ | λ) ∝ exp{−Rn (λ, θ)} (and − (z, θ) as the log-likelihood), in which ˆλ case θn is the maximum a posteriori (MAP) estimate. [sent-81, score-0.298]

25 3 Asymptotic expansion Our main result is the following theorem, which provides a simple interpretable asymptotic expression for the relative risk, characterizing the impact of regularization (see [16] for proof): Theorem 1. [sent-87, score-0.396]

26 n The most important equation of this paper is (6), which captures the lowest-order terms of the relative risk deﬁned in (4). [sent-90, score-0.41]

27 increases the risk by an amount which depends on how “wrong” the regularizer is. [sent-92, score-0.656]

28 ¨ ¨ ¨ ¨ Variance reduction provided by the regularizer tr{I L−1 R(λ)L−1 }: The key quantity is R(λ), ¨ the Hessian of the regularizer, whose impact on the relative risk is channeled through L−1 and ¨ I . [sent-93, score-0.754]

29 The unregularized ˙ estimator errs in direction B; we can reduce the risk if the regularizer bias R(λ) helps correct for the ˙ estimator bias (B R(λ) > 0). [sent-97, score-1.019]

30 The second part carries the same intuition: the risk is reduced when ¨ the random regularizer compensates for the loss (tr{I r (λ)L−1 } < 0). [sent-98, score-0.676]

31 4 Oracle regularizer The principal advantage of having a simple expression for L(λ) is that we can minimize it with def ˆλ∗ respect to λ. [sent-100, score-0.638]

32 Let λ∗ = argminλ L(λ) and call θn the oracle estimator. [sent-101, score-0.138]

33 We have a closed form for λ∗ in the important special case that the regularization parameter λ is the strength of the regularizer: λ Corollary 1 (Oracle regularization strength). [sent-102, score-0.325]

34 If Rn (λ, θ) = n r(θ) for some r(θ), then λ∗ = argmin L(λ) = λ ¨ ¨¨ tr{I L−1 rL−1 } + 2B r def C1 ˙ = , ¨−1 r C2 r L ˙ ˙ L(λ∗ ) = − 2 C1 . [sent-103, score-0.448]

35 Nevertheless, the oracle regularizer provides an upper bound on performance and some insight into the relevant quantities that make a regularizer useful. [sent-109, score-0.753]

36 5 ∂L(λ) ∂λ = −C1 < 0, then (positive) regularization Plugin regularizer While the oracle regularizer Rn (λ∗ , θ) given by (7) is asymptotically optimal, λ∗ depends on the ˆλ∗ unknown θ∞ , so θn is actually not implementable. [sent-115, score-0.912]

37 In this section, we develop the plugin regularizer ˆ def as a way to avoid this dependence. [sent-116, score-0.858]

38 The key idea is to substitute λ∗ with an estimate λn = λ∗ + εn 1 ˆˆ def ˆ where εn = Op (n− 2 ). [sent-117, score-0.34]

39 We then use the plugin estimator θλn = argmin Ln (θ) + Rn (λn , θ). [sent-118, score-0.444]

40 n θ ˆ ˆλ ˆ0 How well does this plugin estimator work, that is, what is its relative risk E[L(θnn ) − L(θn )]? [sent-119, score-0.746]

41 However, we can still leverage existing machinery by deﬁning a new plugin def • ˆ regularizer Rn (λ• , θ) = λ• Rn (λn , θ) with regularization parameter λ• ∈ R. [sent-121, score-1.002]

42 Henceforth, the superscript • will denote quantities concerning the plugin regularizer. [sent-122, score-0.239]

43 The corresponding estimator • ˆ•0 ˆ•λ• ˆ•λ• def θn = argminθ Ln (θ) + Rn (λ• , θ) has relative risk L• (λ• ) = E[L(θn ) − L(θn )]. [sent-123, score-0.866]

44 The key n ˆ ˆˆ ˆ ˆλn = θ•1 , which means the asymptotic risk of the plugin estimator θλn is simply L• (1). [sent-124, score-0.777]

45 identity is θn n n We could try to squeeze more out of the plugin regularizer by further optimizing λ• according to def ˆ•λ•∗ rather than just using λ• = λ•∗ = argminλ• L• (λ• ) and use the oracle plugin estimator θn 1. [sent-125, score-1.358]

46 In general, this is not useful since λ•∗ might depend on θ∞ , and the whole point of plugin is to remove this dependence. [sent-126, score-0.238]

47 The following theorem relates the risks of all estimators we have considered (see [16] for the proof): Theorem 2 (Relative risk of plugin). [sent-131, score-0.45]

48 The relative risk of the plugin estimator is L• (1) = L(λ∗ )+E, def ¨ ˙ ˙ where E = limn→∞ nE[tr{Ln ( Rn (λ∗ )εn ) L−1 }]. [sent-132, score-1.086]

49 If Rn (λ) is linear in λ, then the relative risk of the oracle plugin estimator is L• (λ•∗ ) = L• (1) + E2 4L(λ∗ ) with λ•∗ = 1 + E 2L(λ∗ ) . [sent-133, score-0.884]

50 Having made all the asymptotic derivations in the general setting, we now only need to make a few straightforward calculations to obtain the asymptotic relative risks and regularization parameters for a given problem. [sent-140, score-0.459]

51 4 In his seminal 1961 paper [14], Stein showed that, surprisingly, the standard empirical risk minimizer ˆ0 ˆJS def ¯ ¯ def 1 n θn = X = n i=1 Xi is beaten by the James-Stein estimator θn = X 1 − nd−2 2 in the sense ¯ X ˆJS ˆ0 that E[L(θn )] < E[L(θn )] for all n and θ∞ if d > 2. [sent-150, score-1.133]

52 We will show that the James-Stein estimator 1 is essentially equivalent to O RACLE P LUGIN with quadratic regularization (r(θ) = 2 θ 2 ). [sent-151, score-0.287]

53 By (7), the oracle regularization ˙ ¨ d2 d ∗ weight is λ = θ∞ 2 , which yields a relative risk of L(λ∗ ) = − 2 θ∞ 2 . [sent-153, score-0.692]

54 By Theorem 2, its relative risk is L• (λ•∗ ) = − 2 θ∞ 2 , which offers a 2 ¯ small improvement over P LUGIN (and is superior to U NREGULARIZED when d > 2). [sent-160, score-0.41]

55 Empirically, we found that O RACLE P LUGIN generally had a lower expected risk than JAMES S TEIN when θ∞ is large, but JAMES S TEIN was better when θ∞ ≤ 1. [sent-164, score-0.358]

56 1 Consider a regularizer r(θ) = 2 (θ + 2 log(1 + e−θ )), which corresponds to a Beta( λ , λ ) prior. [sent-169, score-0.298]

57 We will choose λ to minimize expected risk adaptively based on data. [sent-172, score-0.358]

58 def def def ¨ ˙ Deﬁne µ = 1+e1 ∞ , v = µ(1 − µ), and b = µ − 1 . [sent-176, score-1.02]

59 Note that λ > 0, so again (positive) ¨ regularization always helps. [sent-179, score-0.164]

60 (9) Y A(θ; x) = log Y 5 Misspeciﬁcation 0% 5% 50% −1 −1 tr{I vx vvx } 5 5. [sent-187, score-0.171]

61 65 -48 -808 Table 2: The oracle regularizer for the hybrid generative-discriminative estimator. [sent-194, score-0.493]

62 As misspeciﬁcation increases, we regularize less, but the relative risk is reduced more (due to more variance reduction). [sent-195, score-0.466]

63 ˆgen def Given these deﬁnitions, we can either use a generative estimator θn = argminθ Gn (θ), where n 1 ˆdis def Gn (θ) = − n i=1 log pθ (x, y) or a discriminative estimator θn = argminθ Dn (θ), where n 1 Dn (θ) = − n i=1 log pθ (y | x). [sent-196, score-1.042]

64 [17] showed that if the generative model is well-speciﬁed (p∗ (x, y) = pθ∞ (x, y)), then 3 c ˆgen ˆdis the generative estimator is better in the sense that L(θn ) ≤ L(θn ) − n + Op (n− 2 ) for some c ≥ 0; if the model is misspeciﬁed, the discriminative estimator is asymptotically better. [sent-198, score-0.468]

65 To create a hybrid estimator, let us treat the discriminative and generative objectives as the empirical risk and the λ regularizer, respectively, so ((x, y); θ) = − log pθ (y | x), so Ln = Dn and Rn (λ, θ) = n Gn (θ). [sent-199, score-0.524]

66 By moment-generating properties of the exponential family, we arrive at the following quantidef def ¨ ˙ ties (write φ for φ(X, Y )): L = vx = Ep∗ (X) [Vpθ∞ (Y |X) [φ|X]], R(λ) = λ(µ − µxy ) = def ¨ λ(Epθ∞ (X,Y ) [φ] − Ep∗ (X,Y ) [φ]), and R(λ) = λv = λVpθ∞ (X,Y ) [φ]. [sent-202, score-0.815]

67 The oracle regularization parameter is then λ∗ = −1 −1 −1 tr{I vx vvx } + 2B (µ − µxy ) − tr{I r vx } . [sent-203, score-0.588]

68 ⊗ v −1 } tr{(µ − µxy ) x (10) The sign and magnitude of λ∗ provides insight into how generative regularization improves prediction as a function of the model and problem: Speciﬁcally, a large positive λ∗ suggests regularization is helpful. [sent-204, score-0.378]

69 In this case, −1 ¨ I = L, I r = vx , and so the numerator reduces to tr{(v − vx )vx } + 2B (µ − µxy ). [sent-206, score-0.27]

70 Since v vx (the key fact used in [17]), the variance reduction (plus the random alignment term from I r ) is always non-negative with magnitude equal to the fraction of missing information provided by the generative model. [sent-207, score-0.304]

71 If the generative model is almost well-speciﬁed, µ will be close to µxy , and the regularizer should be trusted more (large λ∗ ). [sent-210, score-0.37]

72 An empirical example To provide some concrete intuition, we investigated the oracle regularizer for a synthetic binary classiﬁcation problem of predicting y ∈ {0, 1} from x ∈ {0, 1}k . [sent-212, score-0.436]

73 Table 2 shows how the oracle regularizer changes with ε. [sent-216, score-0.436]

74 But perhaps surprisingly, the relative risk is reduced with more misspeciﬁcation; this is due to the fact that the variance reduction term increases and has a quadratic effect on L(λ∗ ). [sent-218, score-0.499]

75 Figure 1(a) shows the relative risk Ln (λ) for various values of λ. [sent-219, score-0.41]

76 Deﬁne the m λ unlabeled regularizer as Rn (λ, θ) = − nm i=1 log pθ (Xn+i ). [sent-230, score-0.323]

77 Also, ¨ = v − vx , and I r = 0 (the regularizer doesn’t depend on the labeled data). [sent-232, score-0.433]

78 If the model is R conditionally well-speciﬁed, we can verify that the oracle regularization parameter λ∗ is the same as if we had regularized with Gn . [sent-233, score-0.306]

79 This equivalence suggests that the dominant concern asymptotically is developing an adequate generative model with small bias and not exactly how it is used in learning. [sent-234, score-0.146]

80 4 Multi-task regression The intuition behind multi-task learning is to share statistical strength between tasks [3, 12, 2, 13]. [sent-236, score-0.136]

81 2 Optimizing with respect to Λ produces the oracle regularization parameter Λ∗ = d(Θ∞ Θ∞ )−1 and its associated relative risk L(Λ∗ ) = − 1 d2 tr{(Θ∞ Θ∞ )−1 }. [sent-266, score-0.692]

82 2 To analyze P LUGIN, ﬁrst compute f˙ = −d(Θ∞ Θ∞ )−1 (2Θ∞ (·))(Θ∞ Θ∞ )−1 ; we ﬁnd that P LUGIN increases the asymptotic risk by E = 2dtr{(Θ∞ Θ∞ )−1 }. [sent-267, score-0.462]

83 However, the relative risk of P LUGIN is 1 still favorable when d > 4, as L• (1) = − 2 d(d − 4)tr{(Θ∞ Θ∞ )−1 } < 0 for d > 4. [sent-268, score-0.432]

84 2 We can do slightly better using O RACLE P LUGIN (λ•∗ = 1 − d ), which results in a relative risk of 1 • •∗ 2 −1 L (λ ) = − 2 (d − 2) tr{(Θ∞ Θ∞ ) }. [sent-269, score-0.41]

85 For comparison, if we had solved the K regression tasks completely independently with K independent regularization parameters, our relative risk would K k have been − 1 (d − 2)2 ( k=1 θ∞ −2 ) (following similar but simpler computations). [sent-270, score-0.621]

86 The difference in relative risks between joint and independent regularization −1 is ∆ = − 1 (d − 2)2 ( k Dkk − k A−1 ) (∆ < 0 means joint regularization is better). [sent-273, score-0.395]

87 The gap kk 2 k between joint and independent regularization is large when the tasks are non-trivial but similar (θ∞ s −1 −1 k are close, but θ∞ is large). [sent-274, score-0.19]

88 MHC-I binding prediction We evaluated our multitask regularization method on the IEDB MHC-I peptide binding dataset created by [19] and used by [13]. [sent-276, score-0.311]

89 (b) On the MHCI binding prediction task, test risk for the four multi-task estimators; P LUGIN CV (estimating all pairwise task afﬁnities using P LUGIN and cross-validating the strength) works best. [sent-288, score-0.395]

90 Multi-task regularization actually performs worse than independent learning (D IAG CV) if we assume all tasks are equally related (U NIFORM CV). [sent-290, score-0.216]

91 4 Related work and discussion The subject of choosing regularization parameters has received much attention. [sent-293, score-0.144]

92 Much of the learning ˆλ theory literature focuses on risk bounds, which approximate the expected risk (L(θn )) with upper bounds. [sent-294, score-0.695]

93 Though we cannot make a precise statement about the risk for any given n, exact control over the ﬁrst few terms offers other advantages, e. [sent-296, score-0.337]

94 To elaborate further, risk bounds are generally based on the complexity of the hypothesis class, whereas our analysis is based on the variance of the estimator. [sent-299, score-0.402]

95 Vanilla uniform convergence bounds yield worst-case analyses, whereas our asymptotic analysis is tailored to a particular problem (p∗ and θ∞ ) and algorithm (estimator). [sent-300, score-0.148]

96 In fact, our methodology of performing a Taylor expansion of the risk is reminiscent of AIC [1]. [sent-304, score-0.364]

97 However, our aim is different: AIC is intended for model selection, whereas we are interested in optimizing regularization parameters. [sent-305, score-0.193]

98 The Stein unbiased risk estimate (SURE) is another method of estimating the expected risk for linear models [21], with generalizations to non-linear models [11]. [sent-306, score-0.695]

99 To conclude, we have developed a general asymptotic framework for analyzing regularization, along with an efﬁcient procedure for choosing regularization parameters. [sent-311, score-0.248]

100 Although we are so far restricted to parametric problems with smooth losses and regularizers, we think that these tools provide a complementary perspective on analyzing learning algorithms to that of risk bounds, deepening our understanding of regularization. [sent-312, score-0.395]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lugin', 0.484), ('def', 0.34), ('risk', 0.337), ('regularizer', 0.298), ('racle', 0.251), ('plugin', 0.22), ('regularization', 0.144), ('ln', 0.142), ('oracle', 0.138), ('vx', 0.135), ('misspeci', 0.126), ('cv', 0.117), ('estimator', 0.116), ('tr', 0.115), ('rn', 0.108), ('argmin', 0.108), ('asymptotic', 0.104), ('xy', 0.099), ('nregularized', 0.09), ('op', 0.074), ('relative', 0.073), ('generative', 0.072), ('unregularized', 0.072), ('tein', 0.072), ('regularizers', 0.061), ('binding', 0.058), ('estimators', 0.058), ('discriminative', 0.058), ('hybrid', 0.057), ('limn', 0.054), ('tasks', 0.046), ('gn', 0.043), ('bouchard', 0.043), ('js', 0.043), ('loss', 0.041), ('james', 0.041), ('bias', 0.04), ('rd', 0.04), ('nities', 0.038), ('strength', 0.037), ('alignment', 0.036), ('dkk', 0.036), ('iag', 0.036), ('niform', 0.036), ('stein', 0.036), ('vvx', 0.036), ('regularize', 0.035), ('smooth', 0.034), ('asymptotically', 0.034), ('risks', 0.034), ('xn', 0.033), ('intuition', 0.032), ('liang', 0.032), ('ez', 0.031), ('aic', 0.031), ('dis', 0.031), ('gen', 0.031), ('ep', 0.029), ('peptide', 0.029), ('quadratic', 0.027), ('rb', 0.027), ('expansion', 0.027), ('squared', 0.027), ('worse', 0.026), ('impact', 0.026), ('optimizing', 0.026), ('bach', 0.026), ('evgeniou', 0.025), ('dn', 0.025), ('unlabeled', 0.025), ('af', 0.025), ('vp', 0.024), ('parametric', 0.024), ('conditionally', 0.024), ('smoothing', 0.023), ('whereas', 0.023), ('favorable', 0.022), ('interpretable', 0.022), ('kd', 0.022), ('doesn', 0.022), ('multitask', 0.022), ('theorem', 0.021), ('asymptotics', 0.021), ('regression', 0.021), ('expected', 0.021), ('variance', 0.021), ('bounds', 0.021), ('increases', 0.021), ('always', 0.02), ('mahalanobis', 0.02), ('reduction', 0.02), ('differentiable', 0.02), ('quantities', 0.019), ('berkeley', 0.019), ('sections', 0.019), ('france', 0.018), ('bousquet', 0.018), ('id', 0.018), ('might', 0.018), ('sign', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

2 0.2244404 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classiﬁer that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1

3 0.20651801 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

Abstract: This paper tackles the problem of selecting among several linear estimators in non-parametric regression; this includes model selection for linear regression, the choice of a regularization parameter in kernel ridge regression or spline smoothing, and the choice of a kernel in multiple kernel learning. We propose a new algorithm which ﬁrst estimates consistently the variance of the noise, based upon the concept of minimal penalty which was previously introduced in the context of model selection. Then, plugging our variance estimate in Mallows’ CL penalty is proved to lead to an algorithm satisfying an oracle inequality. Simulation experiments with kernel ridge regression and multiple kernel learning show that the proposed algorithm often improves signiﬁcantly existing calibration procedures such as 10-fold cross-validation or generalized cross-validation. 1

4 0.17561083 55 nips-2009-Compressed Least-Squares Regression

Author: Odalric Maillard, Rémi Munos

Abstract: We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M . From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting “Compressed Least Squares Re√ gression” (CLSR) in terms of N , K, and M . When we choose M = O( K), we √ show that CLSR has an estimation error of order O(log K/ K). 1 Problem setting We consider a regression problem where we observe data DK = ({xk , yk }k≤K ) (where xk ∈ X and yk ∈ R) are assumed to be independently and identically distributed (i.i.d.) from some distribution P , where xk ∼ PX and yk = f ∗ (xk ) + ηk (xk ), where f ∗ is the (unknown) target function, and ηk a centered independent noise of variance σ 2 (xk ). For a given class of functions F, and f ∈ F, we deﬁne the empirical (quadratic) error def LK (f ) = 1 K K [yk − f (xk )]2 , k=1 and the generalization (quadratic) error def L(f ) = E(X,Y )∼P [(Y − f (X))2 ]. Our goal is to return a regression function f ∈ F with lowest possible generalization error L(f ). Notations: In the sequel we will make use of the following notations about norms: for h : X → R, we write ||h||P for the L2 norm of h with respect to (w.r.t.) the measure P , ||h||PK for the L2 norm n 2 1/2 of h w.r.t. the empirical measure PK , and for u ∈ Rn , ||u|| denotes by default . i=1 ui The measurable function minimizing the generalization error is f ∗ , but it may be the case that f ∗ ∈ F. For any regression function f , we deﬁne the excess risk / L(f ) − L(f ∗ ) = ||f − f ∗ ||2 , P which decomposes as the sum of the estimation error L(f ) − inf f ∈F L(f ) and the approximation error inf f ∈F L(f ) − L(f ∗ ) = inf f ∈F ||f − f ∗ ||2 which measures the distance between f ∗ and the P function space F. 1 In this paper we consider a class of linear functions FN deﬁned as the span of a set of N functions def def N {ϕn }1≤n≤N called features. Thus: FN = {fα = n=1 αn ϕn , α ∈ RN }. When the number of data K is larger than the number of features N , the ordinary Least-Squares Regression (LSR) provides the LS solution fα which is the minimizer of the empirical risk LK (f ) b 1 in FN . Note that here LK (fα ) rewrites K ||Φα − Y ||K where Φ is the K × N matrix with elements (ϕn (xk ))1≤n≤N,1≤k≤K and Y the K-vector with components (yk )1≤k≤K . Usual results provide bound on the estimation error as a function of the capacity of the function space and the number of data. In the case of linear approximation, the capacity measures (such as covering numbers [23] or the pseudo-dimension [16]) depend on the number of features (for example the pseudo-dimension is at most N + 1). For example, let fα be a LS estimate (minimizer of LK b in FN ), then (a more precise statement will be stated later in Subsection 3) the expected estimation error is bounded as: N log K E L(fα ) − inf L(f ) ≤ cσ2 , (1) b f ∈FN K def where c is a universal constant, σ = supx∈X σ(x), and the expectation is taken with respect to P . Now, the excess risk is the sum of this estimation error and the approximation error inf f ∈FN ||f − f ∗ ||P of the class FN . Since the later usually decreases when the number of features N increases [13] (e.g. when N FN is dense in L2 (P )), we see the usual tradeoff between small estimation error (low N ) and small approximation error (large N ). In this paper we are interested in the setting when N is large so that the approximation error is small. Whenever N is larger than K we face the overﬁtting problem since there are more parameters than actual data (more variables than constraints), which is illustrated in the bound (1) which provides no information about the generalization ability of any LS estimate. In addition, there are many minimizers (in fact a vector space of same dimension as the null space of ΦT Φ) of the empirical risk. To overcome the problem, several approaches have been proposed in the literature: • LS solution with minimal norm: The solution is the minimizer of the empirical error with minimal (l1 or l2 )-norm: α = arg minΦα=Y ||α||1 or 2 , (or a robust solution arg min||Φα−Y ||2 ≤ε ||α||1 ). The choice of 2 -norm yields the ordinary LS solution. The choice of 1 -norm has been used for generating sparse solutions (e.g. the Basis Pursuit [10]), and assuming that the target function admits a sparse decomposition, the ﬁeld of Compressed Sensing [9, 21] provides sufﬁcient conditions for recovering the exact solution. However, such conditions (e.g. that Φ possesses a Restricted Isometric Property (RIP)) does not hold in general in this regression setting. On another aspect, solving these problems (both for l1 or l2 -norm) when N is large is numerically expensive. • Regularization. The solution is the minimizer of the empirical error plus a penalty term, for example f = arg min LK (f ) + λ||f ||p , for p = 1 or 2. p f ∈FN where λ is a parameter and usual choices for the norm are 2 (ridge-regression [20]) and 1 (LASSO [19]). A close alternative is the Dantzig selector [8, 5] which solves: α = arg min||α||1 ≤λ ||ΦT (Y − Φα)||∞ . The numerical complexity and generalization bounds of those methods depend on the sparsity of the target function decomposition in FN . Now if we possess a sequence of function classes (FN )N ≥1 with increasing capacity, we may perform structural risk minimization [22] by solving in each model the empirical risk penalized by a term that depends on the size of the model: fN = arg minf ∈FN ,N ≥1 LK (f ) + pen(N, K), where the penalty term measures the capacity of the function space. In this paper we follow another approach where instead of searching in the large space FN (where N > K) for a solution that minimizes the empirical error plus a penalty term, we simply search for the empirical error minimizer in a (randomly generated) lower dimensional subspace GM ⊂ FN (where M < K). Our contribution: We consider a set of M random linear combinations of the initial N features and perform our favorite LS regression algorithm (possibly regularized) using those “compressed 2 features”. This is equivalent to projecting the K points {ϕ(xk ) ∈ RN , k = 1..K} from the initial domain (of size N ) onto a random subspace of dimension M , and then performing the regression in the “compressed domain” (i.e. span of the compressed features). This is made possible because random projections approximately preserve inner products between vectors (by a variant of the Johnson-Lindenstrauss Lemma stated in Proposition 1. Our main result is a bound on the excess risk of a linear estimator built in the compressed domain in terms of the excess risk of the linear estimator built in the initial domain (Section 2). We further detail the case of ordinary Least-Squares Regression (Section 3) and discuss, in terms of M , N , K, the different tradeoffs concerning the excess risk (reduced estimation error in the compressed domain versus increased approximation error introduced by the random projection) and the numerical complexity (reduced complexity of solving the LSR in the compressed domain versus the additional load of performing the projection). √ As a consequence, we show that by choosing M = O( K) projections we deﬁne a Compressed Least-Squares Regression which uses O(N K 3/2 ) elementary operations to compute a regression √ function with estimation error (relatively to the initial function space FN ) of order log K/ K up to a multiplicative factor which depends on the best approximation of f ∗ in FN . This is competitive with the best methods, up to our knowledge. Related works: Using dimension reduction and random projections in various learning areas has received considerable interest over the past few years. In [7], the authors use a SVM algorithm in a compressed space for the purpose of classiﬁcation and show that their resulting algorithm has good generalization properties. In [25], the authors consider a notion of compressed linear regression. For data Y = Xβ + ε, where β is the target and ε a standard noise, they use compression of the set of data, thus considering AY = AXβ + Aε, where A has a Restricted Isometric Property. They provide an analysis of the LASSO estimator built from these compressed data, and discuss a property called sparsistency, i.e. the number of random projections needed to recover β (with high probability) when it is sparse. These works differ from our approach in the fact that we do not consider a compressed (input and/or output) data space but a compressed feature space instead. In [11], the authors discuss how compressed measurements may be useful to solve many detection, classiﬁcation and estimation problems without having to reconstruct the signal ever. Interestingly, they make no assumption about the signal being sparse, like in our work. In [6, 17], the authors show how to map a kernel k(x, y) = ϕ(x) · ϕ(y) into a low-dimensional space, while still approximately preserving the inner products. Thus they build a low-dimensional feature space speciﬁc for (translation invariant) kernels. 2 Linear regression in the compressed domain We remind that the initial set of features is {ϕn : X → def N FN = {fα = n=1 αn ϕn , α ∈ components (ϕn (x))n≤N . Let us R, 1 ≤ n ≤ N } and the initial domain R } is the span of those features. We write ϕ(x) the N -vector of N now deﬁne the random projection. Let A be a M × N matrix of i.i.d. elements drawn for some distribution ρ. Examples of distributions are: • Gaussian random variables N (0, 1/M ), √ • ± Bernoulli distributions, i.e. which takes values ±1/ M with equal probability 1/2, • Distribution taking values ± 3/M with probability 1/6 and 0 with probability 2/3. The following result (proof in the supplementary material) states the property that inner-product are approximately preserved through random projections (this is a simple consequence of the JohnsonLindenstrauss Lemma): Proposition 1 Let (uk )1≤k≤K and v be vectors of RN . Let A be a M × N matrix of i.i.d. elements drawn from one of the previously deﬁned distributions. For any ε > 0, δ > 0, for M ≥ ε2 1 ε3 log 4K , we have, with probability at least 1 − δ, for all k ≤ K, δ 4 − 6 |Auk · Av − uk · v| ≤ ε||uk || ||v||. 3 def We now introduce the set of M compressed features (ψm )1≤m≤M such that ψm (x) = N We also write ψ(x) the M -vector of components (ψm (x))m≤M . Thus n=1 Am,n ϕn (x). ψ(x) = Aϕ(x). We deﬁne the compressed domain GM = {gβ = m=1 βm ψm , β ∈ RM } the span of the compressed features (vector space of dimension at most M ). Note that each ψm ∈ FN , thus GM is a subspace of FN . def 2.1 M Approximation error We now compare the approximation error assessed in the compressed domain GM versus in the initial space FN . This applies to the linear algorithms mentioned in the introduction such as ordinary LS regression (analyzed in details in Section 3), but also its penalized versions, e.g. LASSO and ridge regression. Deﬁne α+ = arg minα∈RN L(fα ) − L(f ∗ ) the parameter of the best regression function in FN . Theorem 1 For any δ > 0, any M ≥ 15 log(8K/δ), let A be a random M × N matrix deﬁned like in Proposition 1, and GM be the compressed domain resulting from this choice of A. Then with probability at least 1 − δ, inf ||g−f ∗ ||2 ≤ P g∈GM 8 log(8K/δ) + 2 ||α || M E ||ϕ(X)||2 +2 sup ||ϕ(x)||2 x∈X log 4/δ + inf ||f −f ∗ ||2 . P f ∈FN 2K (2) This theorem shows the tradeoff in terms of estimation and approximation errors for an estimator g obtained in the compressed domain compared to an estimator f obtained in the initial domain: • Bounds on the estimation error of g in GM are usually smaller than that of f in FN when M < N (since the capacity of FN is larger than that of GM ). • Theorem 1 says that the approximation error assessed in GM increases by at most O( log(K/δ) )||α+ ||2 E||ϕ(X)||2 compared to that in FN . M def def Proof: Let us write f + = fα+ = arg minf ∈FN ||f − f ∗ ||P and g + = gAα+ . The approximation error assessed in the compressed domain GM is bounded as inf ||g − f ∗ ||2 P g∈GM ≤ ||g + − f ∗ ||2 = ||g + − f + ||2 + ||f + − f ∗ ||2 , P P P (3) since f + is the orthogonal projection of f ∗ on FN and g + belongs to FN . We now bound ||g + − def def f + ||2 using concentration inequalities. Deﬁne Z(x) = Aα+ · Aϕ(x) − α+ · ϕ(x). Deﬁne ε2 = P log(8K/δ) 8 M log(8K/δ). For M ≥ 15 log(8K/δ) we have ε < 3/4 thus M ≥ ε2 /4−ε3 /6 . Proposition 1 applies and says that on an event E of probability at least 1 − δ/2, we have for all k ≤ K, def |Z(xk )| ≤ ε||α+ || ||ϕ(xk )|| ≤ ε||α+ || sup ||ϕ(x)|| = C (4) x∈X On the event E, we have with probability at least 1 − δ , ||g + − f + ||2 P = ≤ ≤ EX∼PX |Z(X)|2 ≤ ε2 ||α+ ||2 ε2 ||α+ ||2 1 K 1 K K |Z(xk )|2 + C 2 k=1 K ||ϕ(xk )||2 + sup ||ϕ(x)||2 x∈X k=1 E ||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x∈X log(2/δ ) 2K log(2/δ ) 2K log(2/δ ) . 2K where we applied two times Chernoff-Hoeffding’s inequality. Combining with (3), unconditioning, and setting δ = δ/2 then with probability at least (1 − δ/2)(1 − δ ) ≥ 1 − δ we have (2). 4 2.2 Computational issues We now discuss the relative computational costs of a given algorithm applied either in the initial or in the compressed domain. Let us write Cx(DK , FN , P ) the complexity (e.g. number of elementary operations) of an algorithm A to compute the regression function f when provided with the data DK and function space FN . We plot in the table below, both for the initial and the compressed versions of the algorithm A, the order of complexity for (i) the cost for building the feature matrix, (ii) the cost for computing the estimator, (iii) the cost for making one prediction (i.e. computing f (x) for any x): Construction of the feature matrix Computing the regression function Making one prediction Initial domain NK Cx(DK , FN , P ) N Compressed domain N KM Cx(DK , GM , P ) NM Note that the values mentioned for the compressed domain are upper-bounds on the real complexity and do not take into account the possible sparsity of the projection matrix A (which would speed up matrix computations, see e.g. [2, 1]). 3 Compressed Least-Squares Regression We now analyze the speciﬁc case of Least-Squares Regression. 3.1 Excess risk of ordinary Least Squares regression In order to bound the estimation error, we follow the approach of [13] which truncates (up to the level ±L where L is a bound, assumed to be known, on ||f ∗ ||∞ ) the prediction of the LS regression function. The ordinary LS regression provides the regression function fα where b α= argmin α∈argminα ∈ RN ||α||. ||Y −Φα || Note that ΦΦT α = ΦT Y , hence α = Φ† Y ∈ RN where Φ† is the Penrose pseudo-inverse of Φ1 . def Then the truncated predictor is: fL (x) = TL [fα (x)], where b def TL (u) = u if |u| ≤ L, L sign(u) otherwise. Truncation after the computation of the parameter α ∈ RN , which is the solution of an unconstrained optimization problem, is easier than solving an optimization problem under the constraint that ||α|| is small (which is the approach followed in [23]) and allows for consistency results and prediction bounds. Indeed, the excess risk of fL is bounded as 1 + log K E(||f − f ∗ ||2 ) ≤ c max{σ2 , L2 } N + 8 inf ||f − f ∗ ||2 (5) P P f ∈FN K where a bound on c is 9216 (see [13]). We have a simpler bound when we consider the expectation EY conditionally on the input data: N EY (||f − f ∗ ||2 K ) ≤ σ2 + inf ||f − f ∗ ||2 K (6) P P K f ∈F Remark: Note that because we use the quadratic loss function, by following the analysis in [3], or by deriving tight bounds on the Rademacher complexity [14] and following Theorem 5.2 of Koltchinskii’s Saint Flour course, it is actually possible to state assumptions under which we can remove the log K term in (5). We will not further detail such bounds since our motivation here is not to provide the tightest possible bounds, but rather to show how the excess risk bound for LS regression in the initial domain extends to the compressed domain. 1 In the full rank case, Φ† = (ΦT Φ)−1 ΦT when K ≥ N and Φ† = ΦT (ΦΦT )−1 when K ≤ N 5 3.2 Compressed Least-Squares Regression (CLSR) CLSR is deﬁned as the ordinary LSR in the compressed domain. Let β = Ψ† Y ∈ RM , where Ψ is the K × M matrix with elements (ψm (xk ))1≤m≤M,1≤k≤K . The CLSR estimate is deﬁned as def gL (x) = TL [gβ (x)]. From Theorem 1, (5) and (6), we deduce the following excess risk bounds for b the CLSR estimate: √ ||α+ || E||ϕ(X)||2 K log(8K/δ) Corollary 1 For any δ > 0, set M = 8 max(σ,L) c (1+log K) . Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate is bounded as √ E(||gL − f ∗ ||2 ) ≤ 16 c max{σ, L}||α+ || E||ϕ(X)||2 P × 1+ supx ||ϕ(x)||2 E||ϕ(X)||2 (1 + log K) log(8K/δ) K log 4/δ + 8 inf ||f − f ∗ ||2 . P f ∈FN 2K (7) √ ||α+ || E||ϕ(X)||2 Now set M = 8K log(8K/δ). Assume N > K and that the features (ϕk )1≤k≤K σ are linearly independent. Then whenever M ≥ 15 log(8K/δ), with probability at least 1 − δ, the expected excess risk of the CLSR estimate conditionally on the input samples is upper bounded as 2 log(8K/δ) supx ||ϕ(x)||2 1+ K E||ϕ(X)||2 EY (||gL − f ∗ ||2 K ) ≤ 4σ||α+ || E||ϕ(X)||2 P log 4/δ . 2K Proof: Whenever M ≥ 15 log(8K/δ) we deduce from Theorem 1 and (5) that the excess risk of gL is bounded as E(||gL − f ∗ ||2 ) ≤ c max{σ2 , L2 } P +8 8 log(8K/δ) + 2 ||α || M 1 + log K M K E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 . P f ∈FN 2K By optimizing on M , we deduce (7). Similarly, using (6) we deduce the following bound on EY (||gL − f ∗ ||2 K ): P σ2 8 M + log(8K/δ)||α+ ||2 K M E||ϕ(X)||2 + 2 sup ||ϕ(x)||2 x log 4/δ + inf ||f − f ∗ ||2 K . P f ∈FN 2K By optimizing on M and noticing that inf f ∈FN ||f − f ∗ ||2 K = 0 whenever N > K and the features P (ϕk )1≤k≤K are linearly independent, we deduce the second result. Remark 1 Note that the second term in the parenthesis of (7) is negligible whenever K Thus we have the expected excess risk log K/δ + inf ||f − f ∗ ||2 . P f ∈FN K E(||gL − f ∗ ||2 ) = O ||α+ || E||ϕ(X)||2 √ P log 1/δ. (8) The choice of M in the previous corollary depends on ||α+ || and E||ϕ(X)|| which are a priori unknown (since f ∗ and PX are unknown). If we set M independently of ||α+ ||, then an additional multiplicative factor of ||α+ || appears in the bound, and if we replace E||ϕ(X)|| by its bound supx ||ϕ(x)|| (which is known) then this latter factor will appear instead of the former in the bound. Complexity of CLSR: The complexity of LSR for computing the regression function in the compressed domain only depends on M and K, and is (see e.g. [4]) Cx(DK , GM , P ) = O(M K 2 ) which √ is of order O(K 5/2 ) when we choose the optimized number of projections M = O( K). However the leading term when using CLSR is the cost for building the Ψ matrix: O(N K 3/2 ). 6 4 4.1 Discussion The factor ||α+ || E||ϕ(X)||2 In light of Corollary 1, the important factor which will determine whether the CLSR provides low generalization error or not is ||α+ || E||ϕ(X)||2 . This factor indicates that a good set of features (for CLSR) should be such that the norm of those features as well as the norm of the parameter α+ of the projection of f ∗ onto the span of those features should be small. A natural question is whether this product can be made small for appropriate choices of features. We now provide two speciﬁc cases for which this is actually the case: (1) when the features are rescaled orthonormal basis functions, and (2) when the features are speciﬁc wavelet functions. In both cases, we relate the bound to an assumption of regularity on the function f ∗ , and show that the dependency w.r.t. N decreases when the regularity increases, and may even vanish. Rescaled Orthonormal Features: Consider a set of orthonormal functions (ηi )i≥1 w.r.t a measure µ, i.e. ηi , ηj µ = δi,j . In addition we assume that the law of the input data is dominated by µ, i.e. PX ≤ Cµ where C is a constant. For instance, this is the case when the set X is compact, µ is the uniform measure and PX has bounded density. def We deﬁne the set of N features as: ϕi = ci ηi , where ci > 0, for i ∈ {1, . . . , N }. Then any f ∈ FN decomposes as f = 2 we have: ||α|| = ||α+ ||2 E||ϕ||2 ≤ C N bi 2 i=1 ( ci ) N bi 2 i=1 ( ci ) and N i=1 N bi i=1 ci ϕi , where N 2 2 i=1 ci X ηi (x)dPX (x) f, ηi ηi = E||ϕ|| = 2 def bi = f, ηi . Thus ≤ C N 2 i=1 ci . Thus N 2 i=1 ci . Now, linear approximation theory (Jackson-type theorems) tells us that assuming a function f ∗ ∈ L2 (µ) is smooth, it may be decomposed onto the span of the N ﬁrst (ηi )i∈{1,...,N } functions with decreasing coefﬁcients |bi | ≤ i−λ for some λ ≥ 0 that depends on the smoothness of f ∗ . For example the class of functions with bounded total variation may be decomposed with Fourier basis (in dimension 1) with coefﬁcients |bi | ≤ ||f ||V /(2πi). Thus here λ = 1. Other classes (such as Sobolev spaces) lead to larger values of λ related to the order of differentiability. √ N By choosing ci = i−λ/2 , we have ||α+ || E||ϕ||2 ≤ C i=1 i−λ . Thus if λ > 1, then this term is bounded by a constant that does not depend on N . If λ = 1 then it is bounded by O(log N ), and if 0 < λ < 1, then it is bounded by O(N 1−λ ). However any orthonormal basis, even rescaled, would not necessarily yield a small ||α+ || E||ϕ||2 term (this is all the more true when the dimension of X is large). The desired property that the coefﬁcients (α+ )i of the decomposition of f ∗ rapidly decrease to 0 indicates that hierarchical bases, such as wavelets, that would decompose the function at different scales, may be interesting. Wavelets: Consider an inﬁnite family of wavelets in [0, 1]: (ϕ0 ) = (ϕ0 ) (indexed by n ≥ 1 or n h,l equivalently by the scale h ≥ 0 and translation 0 ≤ l ≤ 2h − 1) where ϕ0 (x) = 2h/2 ϕ0 (2h x − l) h,l and ϕ0 is the mother wavelet. Then consider N = 2H features (ϕh,l )1≤h≤H deﬁned as the rescaled def wavelets ϕh,l = ch 2−h/2 ϕ0 , where ch > 0 are some coefﬁcients. Assume the mother wavelet h,l is C p (for p ≥ 1), has at least p vanishing moments, and that for all h ≥ 0, supx l ϕ0 (2h x − l)2 ≤ 1. Then the following result (proof in the supplementary material) provides a bound on supx∈X ||ϕ(x)||2 (thus on E||ϕ(X)||2 ) by a constant independent of N : Proposition 2 Assume that f ∗ is (L, γ)-Lipschitz (i.e. for all v ∈ X there exists a polynomial pv of degree γ such that for all u ∈ X , |f (u) − pv (u)| ≤ L|u − v|γ ) with 1/2 < γ ≤ p. Then setting γ 1 ch = 2h(1−2γ)/4 , we have ||α+ || supx ||ϕ(x)|| ≤ L 1−22 |ϕ0 |, which is independent of N . 1/2−γ 0 Notice that the Haar walevets has p = 1 vanishing moment but is not C 1 , thus the Proposition does not apply directly. However direct computations show that if f ∗ is L-Lipschitz (i.e. γ = 1) then L 0 αh,l ≤ L2−3h/2−2 , and thus ||α+ || supx ||ϕ(x)|| ≤ 4(1−2−1/2 ) with ch = 2−h/4 . 7 4.2 Comparison with other methods In the case when the factor ||α+ || E||ϕ(X)||2 does not depend on N (such as in the previous example), the bound (8) on the excess risk of CLSR states that the estimation error (assessed in √ √ terms of FN ) of CLSR is O(log K/ K). It is clear that whenever N > K (which is the case of interest here), this is better than the ordinary LSR in the initial domain, whose estimation error is O(N log K/K). It is difﬁcult to compare this result with LASSO (or the Dantzig selector that has similar properties [5]) for which an important aspect is to design sparse regression functions or to recover a solution assumed to be sparse. From [12, 15, 24] one deduces that under some assumptions, the estimation error of LASSO is of order S log N where S is the sparsity (number of non-zero coefﬁcients) of the K√ best regressor f + in FN . If S < K then LASSO is more interesting than CLSR in terms of excess risk. Otherwise CLSR may be an interesting alternative although this method does not make any assumption about the sparsity of f + and its goal is not to recover a possible sparse f + but only to make good predictions. However, in some sense our method ﬁnds a sparse solution in the fact that the regression function gL lies in a space GM of small dimension M N and can thus be expressed using only M coefﬁcients. Now in terms of numerical complexity, CLSR requires O(N K 3/2 ) operations to build the matrix and compute the regression function, whereas according to [18], the (heuristical) complexity of the LASSO algorithm is O(N K 2 ) in the best cases (assuming that the number of steps required for convergence is O(K), which is not proved theoretically). Thus CLSR seems to be a good and simple competitor to LASSO. 5 Conclusion We considered the case when the number of features N is larger than the number of data K. The result stated in Theorem 1 enables to analyze the excess risk of any linear regression algorithm (LS or its penalized versions) performed in the compressed domain GM versus in the initial space FN . In the compressed domain the estimation error is reduced but an additional (controlled) approximation error (when compared to the best regressor in FN ) comes into the picture. In the case of LS regression, when the term ||α+ || E||ϕ(X)||2 has a mild dependency on N , then by choosing a √ random subspace of dimension M = O( K), CLSR has an estimation error (assessed in terms of √ FN ) bounded by O(log K/ K) and has numerical complexity O(N K 3/2 ). In short, CLSR provides an alternative to usual penalization techniques where one ﬁrst selects a random subspace of lower dimension and then performs an empirical risk minimizer in this subspace. Further work needs to be done to provide additional settings (when the space X is of dimension > 1) for which the term ||α+ || E||ϕ(X)||2 is small. Acknowledgements: The authors wish to thank Laurent Jacques for numerous comments and Alessandro Lazaric and Mohammad Ghavamzadeh for exciting discussions. This work has been supported by French National Research Agency (ANR) through COSINUS program (project EXPLO-RA, ANR-08-COSI-004). References [1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687, June 2003. [2] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast JohnsonLindenstrauss transform. In STOC ’06: Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557–563, New York, NY, USA, 2006. ACM. [3] Jean-Yves Audibert and Olivier Catoni. Risk bounds in linear regression through pac-bayesian truncation. Technical Report HAL : hal-00360268, 2009. [4] David Bau III and Lloyd N. Trefethen. Numerical linear algebra. Philadelphia: Society for Industrial and Applied Mathematics, 1997. 8 [5] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. To appear in Annals of Statistics, 2008. [6] Avrim Blum. Random projection, margins, kernels, and feature-selection. Subspace, Latent Structure and Feature Selection, pages 52–68, 2006. [7] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain. Technical Report, 2009. [8] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35:2313, 2007. [9] Emmanuel J. Candes and Justin K. Romberg. Signal recovery from random projections. volume 5674, pages 76–86. SPIE, 2005. [10] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20:33–61, 1998. [11] Mark A. Davenport, Michael B. Wakin, and Richard G. Baraniuk. Detection and estimation with compressive measurements. Technical Report TREE 0610, Department of Electrical and Computer Engineering, Rice University, 2006. [12] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. Bernoulli, 10:971–988, 2004. [13] L. Gy¨ rﬁ, M. Kohler, A. Krzy˙ ak, and H. Walk. A distribution-free theory of nonparametric o z regression. Springer-Verlag, 2002. [14] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Neural Information Processing Systems, pages 793– 800. MIT Press, 2008. [15] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group Lasso estimator for linear models. Electron. J. Statist., 2:605–633, 2008. [16] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, New York, 1984. [17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Neural Information Processing Systems, 2007. [18] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012, 2007. [19] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. [20] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math Dokl 4, pages 1035–1038, 1963. [21] Yaakov Tsaig and David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52:1289–1306, 2006. [22] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995. [23] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002. [24] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regularization. To appear in Annals of Statistics, 2009. [25] Shuheng Zhou, John D. Lafferty, and Larry A. Wasserman. Compressed regression. In John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Neural Information Processing Systems. MIT Press, 2007. 9

5 0.14981289 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model ﬁts the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a uniﬁed framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identiﬁes two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

6 0.1440334 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

7 0.11366671 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

8 0.10158034 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

9 0.097997263 71 nips-2009-Distribution-Calibrated Hierarchical Classification

10 0.085911199 94 nips-2009-Fast Learning from Non-i.i.d. Observations

11 0.078097776 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

12 0.077456512 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

13 0.072753876 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

14 0.067558855 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank

15 0.064645544 50 nips-2009-Canonical Time Warping for Alignment of Human Behavior

16 0.06277585 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

17 0.059137411 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs

18 0.056789782 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

19 0.055132098 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

20 0.05513154 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.177), (1, 0.179), (2, 0.002), (3, 0.094), (4, 0.002), (5, -0.049), (6, -0.01), (7, -0.122), (8, 0.05), (9, 0.071), (10, 0.117), (11, -0.012), (12, -0.06), (13, -0.115), (14, 0.185), (15, 0.067), (16, -0.0), (17, 0.046), (18, 0.263), (19, -0.038), (20, 0.097), (21, -0.041), (22, 0.114), (23, -0.187), (24, 0.075), (25, -0.028), (26, -0.001), (27, -0.006), (28, 0.118), (29, 0.091), (30, 0.108), (31, 0.068), (32, 0.038), (33, -0.132), (34, -0.04), (35, 0.014), (36, 0.092), (37, -0.018), (38, -0.002), (39, -0.011), (40, 0.018), (41, -0.003), (42, -0.036), (43, -0.009), (44, -0.087), (45, 0.081), (46, 0.034), (47, 0.029), (48, 0.05), (49, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9529314 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

2 0.73594093 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette

3 0.68196553 55 nips-2009-Compressed Least-Squares Regression

Author: Odalric Maillard, Rémi Munos

4 0.60145217 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

5 0.55389625 94 nips-2009-Fast Learning from Non-i.i.d. Observations

Author: Ingo Steinwart, Andreas Christmann

Abstract: We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from α-mixing processes. To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i.i.d.) processes, it turns out that these learning rates are close to the optimal rates known in the i.i.d. case. 1

6 0.51484054 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

7 0.4740808 71 nips-2009-Distribution-Calibrated Hierarchical Classification

8 0.45903033 124 nips-2009-Lattice Regression

9 0.45538965 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

10 0.45507833 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

11 0.44628391 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

12 0.43969449 67 nips-2009-Directed Regression

13 0.40252426 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

14 0.40087533 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

15 0.39327148 69 nips-2009-Discrete MDL Predicts in Total Variation

16 0.37013355 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

17 0.33969259 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

18 0.32606566 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

19 0.32187638 14 nips-2009-A Parameter-free Hedging Algorithm

20 0.31162146 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.249), (24, 0.063), (25, 0.109), (35, 0.038), (36, 0.095), (39, 0.031), (58, 0.114), (61, 0.064), (71, 0.04), (81, 0.012), (86, 0.079), (91, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79503089 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan

2 0.7149173 72 nips-2009-Distribution Matching for Transduction

Author: Novi Quadrianto, James Petterson, Alex J. Smola

Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modiﬁcation for classiﬁcation, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efﬁciently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1

3 0.65101135 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

Author: Sylvain Arlot, Francis R. Bach

4 0.64613771 33 nips-2009-Analysis of SVM with Indefinite Kernels

Author: Yiming Ying, Colin Campbell, Mark Girolami

Abstract: The recent introduction of indeﬁnite SVM by Luss and d’Aspremont [15] has effectively demonstrated SVM classiﬁcation with a non-positive semi-deﬁnite kernel (indeﬁnite kernel). This paper studies the properties of the objective function introduced there. In particular, we show that the objective function is continuously differentiable and its gradient can be explicitly computed. Indeed, we further show that its gradient is Lipschitz continuous. The main idea behind our analysis is that the objective function is smoothed by the penalty term, in its saddle (min-max) representation, measuring the distance between the indeﬁnite kernel matrix and the proxy positive semi-deﬁnite one. Our elementary result greatly facilitates the application of gradient-based algorithms. Based on our analysis, we further develop Nesterov’s smooth optimization approach [17, 18] for indeﬁnite SVM which has an optimal convergence rate for smooth problems. Experiments on various benchmark datasets validate our analysis and demonstrate the efﬁciency of our proposed algorithms.

5 0.64364988 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

Author: Arnak Dalalyan, Renaud Keriven

Abstract: We propose a new approach to the problem of robust estimation in multiview geometry. Inspired by recent advances in the sparse recovery problem of statistics, we deﬁne our estimator as a Bayesian maximum a posteriori with multivariate Laplace prior on the vector describing the outliers. This leads to an estimator in which the ﬁdelity to the data is measured by the L∞ -norm while the regularization is done by the L1 -norm. The proposed procedure is fairly fast since the outlier removal is done by solving one linear program (LP). An important difference compared to existing algorithms is that for our estimator it is not necessary to specify neither the number nor the proportion of the outliers. We present strong theoretical results assessing the accuracy of our procedure, as well as a numerical example illustrating its efﬁciency on real data. 1

6 0.63984352 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

7 0.63401479 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

8 0.63381237 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

9 0.63369596 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

10 0.63110864 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

11 0.63074833 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

12 0.63067126 113 nips-2009-Improving Existing Fault Recovery Policies

13 0.62868583 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

14 0.62773663 97 nips-2009-Free energy score space

15 0.62729132 94 nips-2009-Fast Learning from Non-i.i.d. Observations

16 0.62627178 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

17 0.62583858 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

18 0.62552458 71 nips-2009-Distribution-Calibrated Hierarchical Classification

19 0.62406677 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

20 0.62347466 104 nips-2009-Group Sparse Coding