jmlr jmlr2009 jmlr2009-21 knowledge-graph by maker-knowledge-mining

21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression


Source: pdf

Author: Sylvain Arlot, Pascal Massart

Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. [sent-9, score-0.312]

2 The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. [sent-10, score-0.348]

3 4, by proving that twice the minimal penalty has some optimality properties (Theorem 3), we extend the so-called slope heuristics to heteroscedatic regression with a random design. [sent-79, score-0.321]

4 Although the slope heuristics has not been formally validated in these frameworks, this article is a first step towards such a validation, by proving that the slope heuristics can be applied whatever the shape of the ideal penalty. [sent-85, score-0.367]

5 Given a particular set of predictors Sm (called a model), we define the best predictor over Sm as sm := arg min { Pγ(t) } , t∈Sm with its empirical counterpart sm := arg min { Pn γ(t) } t∈Sm (when it exists and is unique), where Pn = n−1 ∑n δ(Xi ,Yi ) . [sent-106, score-1.392]

6 The model selection problem consists in looking for some data-dependent m ∈ Mn such that ( s, sm ) is as small as possible. [sent-110, score-0.674]

7 For instance, it would be convenient to prove some oracle inequality of the form ( s, sm ) ≤ C inf { ( s, sm ) } + Rn m∈Mn in expectation or on an event of large probability, with leading constant C close to 1 and R n = o(n−1 ). [sent-111, score-1.46]

8 Let pen : M n → R+ be some penalty function, possibly data-dependent, and define m ∈ arg min { crit(m) } m∈Mn with 248 crit(m) := Pn γ(sm ) + pen(m) . [sent-113, score-0.412]

9 (2) DATA - DRIVEN C ALIBRATION OF P ENALTIES Since the ideal criterion crit(m) is the true prediction error Pγ ( sm ), the ideal penalty is penid (m) := Pγ(sm ) − Pn γ(sm ) . [sent-114, score-0.896]

10 We will show below, in a general setting, that when pen is a good estimator of the ideal penalty pen id , then m satisfies an oracle inequality with leading constant C close to 1. [sent-117, score-0.785]

11 For every m ∈ Mn , we define p1 (m) = P ( γ(sm ) − γ(sm ) ) p2 (m) = Pn ( γ(sm ) − γ(sm ) ) δ(m) = (Pn − P) ( γ(sm ) ) so that and Hence, for every m ∈ Mn , penid (m) = p1 (m) + p2 (m) − δ(m) ( s, sm ) = Pn γ(sm ) + p1 (m) + p2 (m) − δ(m) − Pγ(s) . [sent-119, score-0.774]

12 ( s, sm ) + (pen − penid )(m) ≤ ( s, sm ) + (pen − penid )(m) . [sent-120, score-1.484]

13 3 The Slope Heuristics If the penalty is too big, the left-hand side of (3) is larger than ( s, sm ) so that (3) implies an oracle inequality, possibly with large leading constant C. [sent-123, score-0.844]

14 On the contrary, if the penalty is too small, the left-hand side of (3) may become negligible with respect to ( s, sm ) (which would make C explode) or—worse—may be nonpositive. [sent-124, score-0.76]

15 We shall see in the following that ( s, sm ) blows up if and only if the penalty is smaller than some “minimal penalty”. [sent-126, score-0.76]

16 Then, E [ crit(m) ] = E [ Pn γ ( sm ) ] = Pγ ( sm ), so that m approximately minimizes its bias. [sent-128, score-1.312]

17 Therefore, m is one of the more complex models, and the risk of sm is large. [sent-129, score-0.679]

18 Hence, the ideal penalty penid (m) ≈ p1 (m) + p2 (m) is close to 2p2 (m). [sent-136, score-0.215]

19 Since p2 (m) is a “minimal penalty”, the optimal penalty is close to twice the minimal penalty: penid (m) ≈ 2 penmin (m) . [sent-137, score-0.265]

20 Note that a formal proof of the validity of the slope heuristics has only been given for Gaussian homoscedastic least-squares regression with a fixed design (Birg´ and Massart, e 2007); up to the best of our knowledge, the present paper yields the second theoretical result on the slope heuristics. [sent-139, score-0.314]

21 On the contrary, when the penalty is larger than pen min , the complexity of m is much smaller. [sent-142, score-0.391]

22 1 The General Algorithm Assume that the shape penshape : Mn → R+ of the ideal penalty is known, from some prior knowledge or because it had first been estimated, see Section 3. [sent-147, score-0.237]

23 3, the following algorithm provides an optimal calibration of the penalty penshape . [sent-154, score-0.234]

24 Then, once Pn γ ( sm ) and penshape (m) are known for every m ∈ Mn , the complexity of Algorithm 1 is O (Card(Mn )2 ) (see Algorithm 2 and Proposition 1). [sent-165, score-0.742]

25 250 DATA - DRIVEN C ALIBRATION OF P ENALTIES We start with some notations: ∀m ∈ Mn , and f (m) = Pn γ ( sm ) ∀K ≥ 0, g(m) = penshape (m) m(K) ∈ arg min { f (m) + Kg(m) } . [sent-171, score-0.766]

26 Algorithm 2 (Step 1 of Algorithm 1) For every m ∈ Mn , define f (m) = Pn γ ( sm ) and g(m) = penshape (m). [sent-183, score-0.742]

27 The penalty shape is pen shape (m) = Dm and the dimension threshold is Dthresh = 19 ≈ n/(2 ln(n)). [sent-220, score-0.468]

28 With the same simulated data, we have compared the prediction errors of the two methods by estimating the constant Cor that would appear in some oracle inequality, Cor := E [ ( s, sm ) ] E infm∈Mn { ( s, sm ) } . [sent-258, score-1.381]

29 In a more general framework, Algorithm 1 allows to choose a different shape of penalty pen shape . [sent-289, score-0.448]

30 This can be shown from computations made by (Arlot, 2008c, Proposition 1) when S m is assumed to be the vector space of piecewise constant functions on a partition ( Iλ )λ∈Λm of X : E [ penid (m) ] = E [ (P − Pn )γ ( sm ) ] ≈ 2 ∑ E σ(X)2 X ∈ Iλ n λ∈Λ . [sent-292, score-0.778]

31 A first way to estimate the shape of the penalty is simply to use (7) to compute pen shape , when both the distribution of X and the shape of the noise level σ are known. [sent-295, score-0.486]

32 Up to a multiplicative factor (automatically estimated by Algorithm 1), these penalties should estimate correctly E [ pen id (m) ] in any framework. [sent-298, score-0.348]

33 In this case, the shape of the penalty penshape can for instance be estimated with the global or local Rademacher complexities mentioned in Section 1. [sent-305, score-0.228]

34 First, Theorem 2 provides lower bounds on D m and the risk of sm when the penalty is smaller than penmin (m) := E [ p2 (m) ]. [sent-318, score-0.815]

35 The empirical risk minimizer sm on Sm is called a regressogram. [sent-330, score-0.679]

36 In particular, we have:   βλ   ∑ λ∈Λm Iλ and sm = ∑ λ∈Λm βλ   sm = Iλ , where βλ := EP [Y | X ∈ Iλ ] βλ := 1 Yi n pλ X∑λ i ∈I pλ := Pn (X ∈ Iλ ) . [sent-333, score-1.312]

37 Note that sm is uniquely defined if and only if each Iλ contains at least one of the Xi . [sent-334, score-0.656]

38 Otherwise, sm is not uniquely defined and we consider that the model m cannot be chosen. [sent-335, score-0.656]

39 255 A RLOT AND M ASSART For any penalty function pen : Mn → R+ , we define the following model selection procedure: m ∈ arg min m∈Mn , minλ∈Λm { pλ }>0 { Pn γ(sm ) + pen(m) } . [sent-345, score-0.43]

40 (Apu ) The bias decreases as a power of Dm : there exist some β+ ,C+ > 0 such that −β ( s, sm ) ≤ C+ Dm + . [sent-352, score-0.656]

41 On the same event, ( s, sm ) ≥ ln(n) inf { ( s, sm ) } . [sent-369, score-1.331]

42 3, proving that a minimal amount of penalization is required; when the penalty is smaller, the selected dimension D m and the quadratic risk of the final estimator ( s, sm ) blow up. [sent-372, score-0.932]

43 This coupling is quite interesting, since the dimension Dm is known in practice, contrary to ( s, sm ). [sent-373, score-0.691]

44 Moreover, we have a general formulation for the minimal penalty penmin (m) := E [ Pn ( γ(sm ) − γ(sm ) ) ] = E [ p2 (m) ] , which can be used in frameworks situations where it is not proportional to the dimension D m of the models (see Section 3. [sent-378, score-0.215]

45 The following result is a formal proof of this link in the framework we consider: penalties close to twice the minimal penalty satisfy an oracle inequality with leading constant approximately equal to one. [sent-400, score-0.33]

46 2 are satisfied together with 257 A RLOT AND M ASSART (Ap) The bias decreases like a power of Dm : there exist β− ≥ β+ > 0 and C+ ,C− > 0 such that C− D−β− ≤ ( s, sm ) ≤ C+ D−β+ . [sent-402, score-0.656]

47 (11) Then, for every 0 < η < min { β+ ; 1 } /2, there exist a constant K3 and a sequence εn tending to zero at infinity such that, with probability at least 1 − K3 n−2 , Dm ≤ n1−η and ( s, sm ) ≤ 1+δ + εn 1−δ inf { ( s, sm ) } , m∈Mn (12) where m is defined by (8). [sent-404, score-1.389]

48 Moreover, we have the oracle inequality E [ ( s, sm ) ] ≤ 1+δ + εn 1−δ E inf { ( s, sm ) } + m∈Mn A2 K3 . [sent-405, score-1.419]

49 This theorem shows that twice the minimal penalty penmin pointed out by Theorem 2 satisfies an oracle inequality with leading constant almost one. [sent-408, score-0.298]

50 The oracle inequality (12) remains valid when the penalty is only close to twice the minimal one. [sent-413, score-0.235]

51 Assuming ( s, sm ) > 0 for every m ∈ Mn is classical for proving the asymptotic optimality of Mallows’ C p (Shibata, 1981; Li, 1987; Birg´ and Massart, 2007). [sent-423, score-0.672]

52 Note that if there is a small model close to s, it is hopeless to obtain an oracle inequality with a penalty which estimates pen id , simply because deviations of penid around its expectation would be much larger than the excess loss of the oracle. [sent-437, score-0.546]

53 As in Algorithm 1, let us define m(K) ∈ arg min { Pn γ ( sm ) + KE [ p2 (m) ] } . [sent-459, score-0.696]

54 Let us know explain why Algorithm 1 is correct, assuming that pen shape (m) is close to E [ p2 (m) ]. [sent-462, score-0.306]

55 Let γn be some empirical criterion (for instance the least-squares criterion as in this paper, or the log-likelihood criterion), ( S m )m∈Mn be a collection of models and for every m ∈ Mn sm be some minimizer of t → E [ γn (t ) ] over Sm (assuming that such a point exists). [sent-487, score-0.672]

56 Minimizing some penalized criterion γn ( sm ) + pen(m) over Mn amounts to minimize bm − vm + pen(m) , where ∀m ∈ Mn , bm = γn ( sm ) − γn ( s ) and vm = γn ( sm ) − γn ( sm ) . [sent-488, score-2.719]

57 The point is that bm is an unbiased estimator of the bias term ( s, sm ). [sent-489, score-0.687]

58 Having concentration arguments in mind, minimizing bm − vm + pen(m) can be conjectured approximately equivalent to minimize ( s, sm ) − E [ vm ] + pen(m) . [sent-490, score-0.747]

59 Since the purpose of model selection is to minimize the risk E [ ( s, sm ) ], an ideal penalty would be pen(m) = E [ vm ] + E [ ( sm , sm ) ] . [sent-491, score-2.164]

60 In Gaussian least-squares regression with a fixed design, the models S m are linear and E [ vm ] = E [ ( sm , sm ) ] is explicitly computable if the noise level is constant and known; this leads to Mallows’ C p penalty. [sent-492, score-1.36]

61 When γn is the log-likelihood, E [ v m ] ≈ E [ ( s m , sm ) ] ≈ Dm 2n asymptotically, where Dm stands for the number of parameters defining model Sm ; this leads to Akaike’s Information Criterion (AIC). [sent-493, score-0.656]

62 This paper goes further in this direction, using that E [ vm ] ≈ E [ ( sm , sm ) ] remains a valid approximation in a non-asymptotic framework. [sent-495, score-1.338]

63 Since vm is the minimal penalty, this explains the slope heuristics (Birg´ and Massart, 2007) and connects it to Mallows’ C p and Akaike’s heuristics. [sent-497, score-0.221]

64 e The second main idea developed in this paper is that the minimal penalty can be estimated from the data; Algorithm 1 uses the jump of complexity which occurs around the minimal penalty, as shown in Sections 3. [sent-498, score-0.218]

65 Another way to estimate the minimal penalty when it is (at least approximately) of the form αDm is to estimate α by the slope of the graph of γn ( sm ) for large enough values of Dm ; this method can be extended to other shapes of penalties, simply by replacing Dm by some (known! [sent-502, score-0.897]

66 • ∀Iλ ⊂ X , pλ := P(X ∈ Iλ ) and σ2 := E (Y − sm (X) )2 X ∈ Iλ . [sent-542, score-0.656]

67 2 Proof of Proposition 1 / First, since Mn is finite, the infimum in (4) is attained as soon as G(mi−1 ) = 0, so that mi is well defined for every i ≤ imax . [sent-550, score-0.227]

68 2 are satisfied together with (Ap) The bias decreases like a power of Dm : there exist β− ≥ β+ > 0 and C+ ,C− > 0 such that C− D−β− ≤ ( s, sm ) ≤ C+ D−β+ . [sent-603, score-0.656]

69 Moreover, we have the oracle inequality E [ ( s, sm ) ] ≤ 1 + (C1 +C2 − 2)+ + εn E ( c1 + c2 − 1 ) ∧ 1 inf { ( s, sm ) } + m∈Mn A2 K3 . [sent-606, score-1.419]

70 The particular form of condition (22) on the penalty is motivated by the fact that the ideal shape of penalty E [ penid (m) ] (or equivalently E [ 2p2 (m) ]) is unknown in general. [sent-609, score-0.357]

71 The rationale behind Theorem 5 is that if pen(m) is close to c 1 p1 (m) + c2 p2 (m), then crit(m) ≈ ( s, sm ) + c1 p1 (m) + (c2 − 1)p2 (m). [sent-613, score-0.656]

72 When c1 = c2 = 1, this is exactly the ideal criterion ( s, sm ). [sent-614, score-0.681]

73 From (3), we have for each m ∈ Mn such that An (m) := minλ∈Λm { n pλ } > 0 ( s, sm ) − penid (m) − pen(m) ≤ ( s, sm ) + pen(m) − penid (m) . [sent-621, score-1.484]

74 (25) with penid (m) := p1 (m) + p2 (m) − δ(m) = pen(m) + (P − Pn )γ(s) and δ(m) := (Pn − P)(γ ( sm ) − γ ( s )). [sent-622, score-0.742]

75 It is sufficient to control pen − penid for every m ∈ Mn . [sent-623, score-0.37]

76 Combined with (25), this gives: if n ≥ L(SH5) ,   ( s, sm ) ln(n)5 ≤Dm ≤LcX n ln(n)−1 r, × m∈Mn s. [sent-631, score-0.656]

77 ≤ L(SH5) 1 + (C1 +C2 − 2)+ + ( c1 + c2 − 1 ) ∧ 1 ln(n)1/4 inf ln(n)7 ≤Dm ≤Lα X M ,cr, n ln(n)−1 { ( s, sm ) } . [sent-633, score-0.675]

78 We now use Lemmas 6 and 7 below to control on Ωn the dimensions of the selected model m and the oracle model m ∈ arg minm∈Mn { ( s, sm ) }. [sent-634, score-0.746]

79 1 C LASSICAL O RACLE I NEQUALITY Since (23) holds true on Ωn , Ωn ] + E   ≤ [ 2η − 1 + εn ] E ( s, sm )   E [ ( s, sm ) ] = E [ ( s, sm ) Ωc n inf { ( s, sm ) } + A2 K3 P ( Ωc ) n m∈Mn which proves (24). [sent-639, score-2.643]

80 Lemma 7 (Control on the dimension of the oracle model) Define the oracle model m ∈ arg minm∈Mn { ( s, sm ) }. [sent-642, score-0.835]

81 It thus also minimizes crit (m) = crit(m) − Pn γ(s) = ( s, sm ) − p2 (m) + δ(m) + pen(m) over Mn . [sent-648, score-0.774]

82 We then have ( s, sm ) ≥ C− ( ln(n) )−7β− from (Ap) pen(m) ≥ 0 p2 (m) ≤ L(SH5) Dm ln(n) + L(SH5) ≤ L(SH5) n n ln(n) n from (29) and from (28) (in Proposition 8), δ(m) ≥ −LA ( s, sm ) ln(n) ln(n) + LA ≥ −LA n n ln(n) . [sent-651, score-1.312]

83 3 P ROOF OF L EMMA 7 Recall that m minimizes ( s, sm ) = ( s, sm ) + p1 (m) over m ∈ Mn , with the convention ( s, sm ) = ∞ if An (m) = 0. [sent-665, score-1.968]

84 Lower bound on ( s, sm ) for small models: let m ∈ Mn such that Dm < ( ln(n) )7 . [sent-667, score-0.656]

85 From (Ap), we have ( s, sm ) ≥ ( s, sm ) ≥ C− ( ln(n) )−7β− . [sent-668, score-1.312]

86 Lower bound on ( s, sm ) for large models: let m ∈ Mn such that Dm > n1/2+α . [sent-670, score-0.656]

87 From (33), for n ≥ L(SH5),α ,   p1 (m) ≥  1 2 + (γ + 1) cX r, so that −1 ln(n) −  L(SH5),α   E [ p2 (m) ] n1/4 ( s, sm ) ≥ p1 (m) ≥ L(SH5),α n−1/2+α . [sent-671, score-0.656]

88 There exists a better model for ( s, sm ): let m0 ∈ Mn be as in the proof of Lemma 6 and assume that n ≥ Lcrich ,α . [sent-673, score-0.656]

89 1 L OWER BOUND ON Dm By definition, m minimizes crit (m) = crit(m) − Pn γ(s) = ( s, sm ) − p2 (m) + δ(m) + pen(m) over m ∈ Mn such that An (m) ≥ 1. [sent-680, score-0.774]

90 Then, ( s, sm ) + pen(m) ≥ 0 and from (26), ln(n) . [sent-685, score-0.656]

91 First, for every model m ∈ M n such that An (m) ≥ 1 and Dm ≥ K2 n ln(n)−1 , we have ( s, sm ) ≥ p1 (m) ≥ L(SH2) K2 ln(n)−2 by (33) . [sent-700, score-0.672]

92 Then for all x ≥ 0, on an event of probability at least δ(m) ≤ η ( s, sm ) + 4 8 + η 3 A2 x . [sent-707, score-0.682]

93 n (26) If moreover (p) Qm := nE [ p2 (m) ] >0 , Dm (27) on the same event, ( s, sm ) 20 A2 E[p2 (m)] √ δ(m) ≤ √ + x . [sent-708, score-0.656]

94 271 A RLOT AND M ASSART Then, for every x ≥ 0, there exists an event of probability at least 1 − e 1−x on which for every θ ∈ (0; 1), √ √ A2 D m x A2 x |p2 (m) − E [ p2 (m) ]| ≤ L θ ( s, sm ) + + (29) n θn for some absolute constant L. [sent-713, score-0.714]

95 , we have on the same event: L |p2 (m) − E [ p2 (m) ]| ≤ √ Dm ( s, sm ) + A2 E [ p2 (m) ] √ x+x σ2 min . [sent-716, score-0.675]

96 4 We obtain that, with probability at least 1 − 2e−x , δ(m) ≤ √ 2vx + c = 16A2 ( s, sm ) x 8A2 x + n 3n √ −1/2 (p) and (26) follows since 2 ab ≤ aη + bη−1 for all η > 0. [sent-743, score-0.656]

97 Then, there exists some absolute constant L such that for every real number q ≥ 2 one has p2 (m) − E[p2 (m)] q L ≤√ n 2q ( s, sm ) ∨ ε ,m 2 +q√ n . [sent-755, score-0.672]

98 2 C OMPUTATION OF P(γ(t, ·) − γ(sm , ·)) for some general t ∈ Sm : P(γ(t, ·) − γ(sm , ·)) = E (t(X) −Y )2 − (sm (X) −Y )2 = E (t(X) − sm (X))2 + 2E [(t(X) − sm (X))(sm (X) − s(X))] = E (t(X) − sm (X))2 ∑ = λ∈Λm pλ (tλ − βλ )2 since for every λ ∈ Λm , E [ s(X) | X ∈ Iλ ] = βλ . [sent-773, score-1.984]

99 3 C OMPUTATION OF Pn (γ(t, ·) − γ(sm , ·)) for some general t ∈ Sm : with ηi = Yi − sm (Xi ), we have Pn (γ(t, ·) − γ(sm , ·)) = 1 n ∑ (t(Xi ) −Yi )2 − (u(Xi ) −Yi )2 n i=1 = 1 n 2 n (t(Xi ) − u(Xi ))2 − ∑ [(t(Xi ) − u(Xi ))ηi ] ∑ n i=1 n i=1 = 1 n ∑ ∑ (t − uλ )2 n i=1 λ∈Λ λ   m A. [sent-776, score-0.656]

100 n n 2 n As already noticed, we have to multiply this bound by 2 so that it is valid for every u ∈ S m and not only u = sm . [sent-790, score-0.672]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sm', 0.656), ('dm', 0.341), ('pen', 0.268), ('mn', 0.263), ('kmin', 0.166), ('ln', 0.166), ('arlot', 0.14), ('massart', 0.123), ('crit', 0.118), ('mi', 0.115), ('birg', 0.113), ('penalty', 0.104), ('enalties', 0.097), ('slope', 0.094), ('alibration', 0.091), ('assart', 0.091), ('rlot', 0.091), ('pn', 0.089), ('penid', 0.086), ('imax', 0.08), ('penalties', 0.08), ('ki', 0.078), ('penshape', 0.07), ('penalization', 0.069), ('oracle', 0.069), ('mallows', 0.068), ('calibration', 0.06), ('heuristics', 0.058), ('heteroscedastic', 0.055), ('driven', 0.048), ('homoscedastic', 0.046), ('kg', 0.045), ('ap', 0.043), ('crich', 0.043), ('minimal', 0.043), ('card', 0.043), ('shape', 0.038), ('dthresh', 0.038), ('sylvain', 0.038), ('piecewise', 0.036), ('arx', 0.033), ('penmin', 0.032), ('regressogram', 0.032), ('proposition', 0.032), ('bn', 0.028), ('jump', 0.028), ('resampling', 0.028), ('lcx', 0.027), ('vm', 0.026), ('event', 0.026), ('asymptotically', 0.026), ('rademacher', 0.025), ('akaike', 0.025), ('ideal', 0.025), ('concentration', 0.025), ('ke', 0.024), ('risk', 0.023), ('cor', 0.023), ('cn', 0.023), ('jumps', 0.023), ('tending', 0.023), ('np', 0.022), ('la', 0.022), ('aic', 0.022), ('regression', 0.022), ('apu', 0.021), ('csisz', 0.021), ('lbn', 0.021), ('occured', 0.021), ('xi', 0.021), ('arg', 0.021), ('pascal', 0.021), ('proved', 0.02), ('dimension', 0.02), ('inf', 0.019), ('inequality', 0.019), ('min', 0.019), ('boucheron', 0.018), ('selection', 0.018), ('estimator', 0.017), ('frameworks', 0.016), ('complexities', 0.016), ('baraud', 0.016), ('kdm', 0.016), ('lcrich', 0.016), ('lebarbier', 0.016), ('lucien', 0.016), ('minm', 0.016), ('regressograms', 0.016), ('yannick', 0.016), ('theorem', 0.016), ('soon', 0.016), ('every', 0.016), ('contrary', 0.015), ('penalized', 0.015), ('leading', 0.015), ('paris', 0.015), ('gaussian', 0.014), ('inequalities', 0.014), ('bm', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

Author: Sylvain Arlot, Pascal Massart

Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram

2 0.069819942 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

Author: Hui Li, Xuejun Liao, Lawrence Carin

Abstract: We consider the problem of multi-task reinforcement learning (MTRL) in multiple partially observable stochastic environments. We introduce the regionalized policy representation (RPR) to characterize the agent’s behavior in each environment. The RPR is a parametric model of the conditional distribution over current actions given the history of past actions and observations; the agent’s choice of actions is directly based on this conditional distribution, without an intervening model to characterize the environment itself. We propose off-policy batch algorithms to learn the parameters of the RPRs, using episodic data collected when following a behavior policy, and show their linkage to policy iteration. We employ the Dirichlet process as a nonparametric prior over the RPRs across multiple environments. The intrinsic clustering property of the Dirichlet process imposes sharing of episodes among similar environments, which effectively reduces the number of episodes required for learning a good policy in each environment, when data sharing is appropriate. The number of distinct RPRs and the associated clusters (the sharing patterns) are automatically discovered by exploiting the episodic data as well as the nonparametric nature of the Dirichlet process. We demonstrate the effectiveness of the proposed RPR as well as the RPR-based MTRL framework on various problems, including grid-world navigation and multi-aspect target classification. The experimental results show that the RPR is a competitive reinforcement learning algorithm in partially observable domains, and the MTRL consistently achieves better performance than single task reinforcement learning. Keywords: reinforcement learning, partially observable Markov decision processes, multi-task learning, Dirichlet processes, regionalized policy representation

3 0.053485122 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

Author: Alexander L. Strehl, Lihong Li, Michael L. Littman

Abstract: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP” algorithms include the wellknown E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Keywords: reinforcement learning, Markov decision processes, PAC-MDP, exploration, sample complexity

4 0.045343906 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Author: Han Liu, John Lafferty, Larry Wasserman

Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult

5 0.037665699 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

Author: Vladimir Vovk, Fedor Zhdanov

Abstract: We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts. The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set. Keywords: Brier game, classification, on-line prediction, strong aggregating algorithm, weighted average algorithm

6 0.034758486 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

7 0.033210669 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

8 0.031519197 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

9 0.030981591 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

10 0.029434899 49 jmlr-2009-Learning Permutations with Exponential Weights

11 0.025467385 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

12 0.024000674 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

13 0.023931403 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

14 0.023090227 67 jmlr-2009-Online Learning with Sample Path Constraints

15 0.022258166 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

16 0.021619283 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations

17 0.020947866 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

18 0.020734446 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

19 0.020261303 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

20 0.020054214 29 jmlr-2009-Estimating Labels from Label Proportions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.107), (1, -0.056), (2, -0.011), (3, 0.009), (4, -0.042), (5, 0.058), (6, -0.064), (7, 0.033), (8, -0.077), (9, -0.008), (10, 0.067), (11, -0.037), (12, -0.095), (13, 0.04), (14, 0.048), (15, 0.077), (16, -0.125), (17, -0.048), (18, -0.023), (19, 0.0), (20, 0.005), (21, 0.018), (22, -0.077), (23, 0.034), (24, 0.164), (25, 0.193), (26, 0.051), (27, 0.375), (28, 0.05), (29, -0.233), (30, -0.096), (31, 0.43), (32, -0.107), (33, -0.043), (34, -0.253), (35, 0.135), (36, -0.132), (37, 0.064), (38, -0.028), (39, -0.092), (40, 0.008), (41, 0.113), (42, 0.057), (43, -0.048), (44, 0.058), (45, -0.1), (46, 0.043), (47, -0.141), (48, 0.042), (49, 0.206)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97642326 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

Author: Sylvain Arlot, Pascal Massart

Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram

2 0.38528946 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

Author: Hui Li, Xuejun Liao, Lawrence Carin

Abstract: We consider the problem of multi-task reinforcement learning (MTRL) in multiple partially observable stochastic environments. We introduce the regionalized policy representation (RPR) to characterize the agent’s behavior in each environment. The RPR is a parametric model of the conditional distribution over current actions given the history of past actions and observations; the agent’s choice of actions is directly based on this conditional distribution, without an intervening model to characterize the environment itself. We propose off-policy batch algorithms to learn the parameters of the RPRs, using episodic data collected when following a behavior policy, and show their linkage to policy iteration. We employ the Dirichlet process as a nonparametric prior over the RPRs across multiple environments. The intrinsic clustering property of the Dirichlet process imposes sharing of episodes among similar environments, which effectively reduces the number of episodes required for learning a good policy in each environment, when data sharing is appropriate. The number of distinct RPRs and the associated clusters (the sharing patterns) are automatically discovered by exploiting the episodic data as well as the nonparametric nature of the Dirichlet process. We demonstrate the effectiveness of the proposed RPR as well as the RPR-based MTRL framework on various problems, including grid-world navigation and multi-aspect target classification. The experimental results show that the RPR is a competitive reinforcement learning algorithm in partially observable domains, and the MTRL consistently achieves better performance than single task reinforcement learning. Keywords: reinforcement learning, partially observable Markov decision processes, multi-task learning, Dirichlet processes, regionalized policy representation

3 0.24465448 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

Author: Han Liu, John Lafferty, Larry Wasserman

Abstract: Recent methods for estimating sparse undirected graphs for real-valued data in high dimensional problems rely heavily on the assumption of normality. We show how to use a semiparametric Gaussian copula—or “nonparanormal”—for high dimensional inference. Just as additive models extend linear models by replacing linear functions with a set of one-dimensional smooth functions, the nonparanormal extends the normal by transforming the variables by smooth functions. We derive a method for estimating the nonparanormal, study the method’s theoretical properties, and show that it works well in many examples. Keywords: graphical models, Gaussian copula, high dimensional inference, sparsity, ℓ1 regularization, graphical lasso, paranormal, occult

4 0.22285879 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

Author: Alexander L. Strehl, Lihong Li, Michael L. Littman

Abstract: We study the problem of learning near-optimal behavior in finite Markov Decision Processes (MDPs) with a polynomial number of samples. These “PAC-MDP” algorithms include the wellknown E3 and R-MAX algorithms as well as the more recent Delayed Q-learning algorithm. We summarize the current state-of-the-art by presenting bounds for the problem in a unified theoretical framework. A more refined analysis for upper and lower bounds is presented to yield insight into the differences between the model-free Delayed Q-learning and the model-based R-MAX. Keywords: reinforcement learning, Markov decision processes, PAC-MDP, exploration, sample complexity

5 0.20067523 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

Author: Tong Zhang

Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches infinity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefficient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefficients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefficients. Keywords: greedy algorithm, feature selection, sparsity

6 0.19112203 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

7 0.15853104 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks

8 0.15133084 88 jmlr-2009-Stable and Efficient Gaussian Process Calculations

9 0.14668056 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

10 0.14653359 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

11 0.14596327 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

12 0.1435678 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

13 0.13903788 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

14 0.11927227 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

15 0.11510398 15 jmlr-2009-Cautious Collective Classification

16 0.10900602 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages    (Special Topic on Mining and Learning with Graphs and Relations)

17 0.10877415 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

18 0.10796164 49 jmlr-2009-Learning Permutations with Exponential Weights

19 0.10503811 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

20 0.10097134 6 jmlr-2009-An Algorithm for Reading Dependencies from the Minimal Undirected Independence Map of a Graphoid that Satisfies Weak Transitivity


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.532), (11, 0.012), (19, 0.029), (21, 0.014), (26, 0.011), (38, 0.024), (47, 0.017), (52, 0.028), (55, 0.052), (58, 0.019), (66, 0.088), (68, 0.017), (90, 0.04), (96, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95250279 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classifiers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classifiers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis

same-paper 2 0.78457004 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

Author: Sylvain Arlot, Pascal Massart

Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a first step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram

3 0.34947833 5 jmlr-2009-Adaptive False Discovery Rate Control under Independence and Dependence

Author: Gilles Blanchard, Étienne Roquain

Abstract: In the context of multiple hypothesis testing, the proportion π0 of true null hypotheses in the pool of hypotheses to test often plays a crucial role, although it is generally unknown a priori. A testing procedure using an implicit or explicit estimate of this quantity in order to improve its efficency is called adaptive. In this paper, we focus on the issue of false discovery rate (FDR) control and we present new adaptive multiple testing procedures with control of the FDR. In a first part, assuming independence of the p-values, we present two new procedures and give a unified review of other existing adaptive procedures that have provably controlled FDR. We report extensive simulation results comparing these procedures and testing their robustness when the independence assumption is violated. The new proposed procedures appear competitive with existing ones. The overall best, though, is reported to be Storey’s estimator, albeit for a specific parameter setting that does not appear to have been considered before. In a second part, we propose adaptive versions of step-up procedures that have provably controlled FDR under positive dependence and unspecified dependence of the p-values, respectively. In the latter case, while simulations only show an improvement over non-adaptive procedures in limited situations, these are to our knowledge among the first theoretically founded adaptive multiple testing procedures that control the FDR when the p-values are not independent. Keywords: multiple testing, false discovery rate, adaptive procedure, positive regression dependence, p-values

4 0.3327041 48 jmlr-2009-Learning Nondeterministic Classifiers

Author: Juan José del Coz, Jorge Díez, Antonio Bahamonde

Abstract: Nondeterministic classifiers are defined as those allowed to predict more than one class for some entries from an input space. Given that the true class should be included in predictions and the number of classes predicted should be as small as possible, these kind of classifiers can be considered as Information Retrieval (IR) procedures. In this paper, we propose a family of IR loss functions to measure the performance of nondeterministic learners. After discussing such measures, we derive an algorithm for learning optimal nondeterministic hypotheses. Given an entry from the input space, the algorithm requires the posterior probabilities to compute the subset of classes with the lowest expected loss. From a general point of view, nondeterministic classifiers provide an improvement in the proportion of predictions that include the true class compared to their deterministic counterparts; the price to be paid for this increase is usually a tiny proportion of predictions with more than one class. The paper includes an extensive experimental study using three deterministic learners to estimate posterior probabilities: a multiclass Support Vector Machine (SVM), a Logistic Regression, and a Na¨ve Bayes. The data sets considered comprise both UCI ı multi-class learning tasks and microarray expressions of different kinds of cancer. We successfully compare nondeterministic classifiers with other alternative approaches. Additionally, we shall see how the quality of posterior probabilities (measured by the Brier score) determines the goodness of nondeterministic predictions. Keywords: nondeterministic, multiclassification, reject option, multi-label classification, posterior probabilities

5 0.33033913 12 jmlr-2009-Bi-Level Path Following for Cross Validated Solution of Kernel Quantile Regression

Author: Saharon Rosset

Abstract: We show how to follow the path of cross validated solutions to families of regularized optimization problems, defined by a combination of a parameterized loss function and a regularization term. A primary example is kernel quantile regression, where the parameter of the loss function is the quantile being estimated. Even though the bi-level optimization problem we encounter for every quantile is non-convex, the manner in which the optimal cross-validated solution evolves with the parameter of the loss function allows tracking of this solution. We prove this property, construct the resulting algorithm, and demonstrate it on real and artificial data. This algorithm allows us to efficiently solve the whole family of bi-level problems. We show how it can be extended to cover other modeling problems, like support vector regression, and alternative in-sample model selection approaches.1

6 0.32624045 4 jmlr-2009-A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

7 0.32449639 43 jmlr-2009-Java-ML: A Machine Learning Library    (Machine Learning Open Source Software Paper)

8 0.32139668 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm    (Special Topic on Mining and Learning with Graphs and Relations)

9 0.31643337 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

10 0.31087786 76 jmlr-2009-Python Environment for Bayesian Learning: Inferring the Structure of Bayesian Networks from Knowledge and Data    (Machine Learning Open Source Software Paper)

11 0.3102507 26 jmlr-2009-Dlib-ml: A Machine Learning Toolkit    (Machine Learning Open Source Software Paper)

12 0.30753115 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

13 0.30689511 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs

14 0.29976693 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

15 0.29540354 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

16 0.29316065 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning

17 0.2925981 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

18 0.29178554 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

19 0.29098472 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

20 0.28894576 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification