nips nips2007 nips2007-101 knowledge-graph by maker-knowledge-mining

101 nips-2007-How SVMs can estimate quantiles and the median

Source: pdf

Author: Andreas Christmann, Ingo Steinwart

Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 How SVMs can estimate quantiles and the median Ingo Steinwart Information Sciences Group CCS-3 Los Alamos National Laboratory Los Alamos, NM 87545, USA ingo@lanl. [sent-1, score-0.099]

2 be Abstract We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. [sent-5, score-0.351]

3 For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . [sent-6, score-0.28]

4 This result is then used to derive an oracle inequality for an SVM based on the pinball loss. [sent-7, score-0.248]

5 Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . [sent-8, score-0.16]

6 The goal of quantile regression is to estimate the conditional quantile, i. [sent-10, score-0.18]

7 Let us now consider the so-called τ -pinball loss Lτ : R × R → [0, ∞) deﬁned by Lτ (y, t) := ψτ (y − t), where ψτ (r) = (τ − 1)r, if r < 0, and ψτ (r) = τ r, if r ≥ 0. [sent-16, score-0.089]

8 RLτ ,P (fτ,P ) = inf RLτ ,P (f ) =: R∗ τ ,P , where the inﬁL mum is taken over all f : X → R. [sent-20, score-0.106]

9 Empirical methods estimating quantiles with the help of the pinball loss typically obtain functions fD for which RLτ ,P (fD ) is close to R∗ τ ,P with high probability. [sent-31, score-0.225]

10 18]), and hence there is so far only little justiﬁcation for using fD as an estimate of the quantile function. [sent-33, score-0.169]

11 Our goal is to address this issue by showing that under certain realistic assumptions on P we have an inequality of the form ∗ f − fτ,P L1 (PX ) ≤ cP RLτ ,P (f ) − R∗ τ ,P . [sent-34, score-0.067]

12 L (2) We then use this inequality to establish an oracle inequality for SVMs deﬁned by (1). [sent-35, score-0.234]

13 In addition, we illustrate how this oracle inequality can be used to obtain learning rates and to justify a datadependent method for ﬁnding the hyper-parameter λ and H. [sent-36, score-0.131]

14 Finally, we generalize the methods for establishing (2) to investigate the role of ǫ in the ǫ-insensitive loss used in standard SVM regression. [sent-37, score-0.089]

15 Given a distribution P on X × Y we further assume throughout this paper that the σ-algebra on X is complete with respect to the marginal distribution PX of P, i. [sent-39, score-0.081]

16 1 A distribution Q on R is said to have a τ -quantile of type α > 0 if there exists a τ -quantile t∗ ∈ R and a constant cQ > 0 such that for all s ∈ [0, α] we have Q (t∗ , t∗ + s) ≥ cQ s and Q (t∗ − s, t∗ ) ≥ cQ s . [sent-44, score-0.096]

17 (3) It is not difﬁcult to see that a distribution Q having a τ -quantile of some type α has a unique τ quantile t∗ . [sent-45, score-0.241]

18 Moreover, if Q has a Lebesgue density hQ then Q has a τ -quantile of type α if hQ is bounded away from zero on [t∗ −α, t∗ +α] since we can use cQ := inf{hQ (t) : t ∈ [t∗ −α, t∗ +α]} in (3). [sent-46, score-0.076]

19 This assumption is general enough to cover many distributions used in parametric statistics such as Gaussian, Student’s t, and logistic distributions (with Y = R), Gamma and log-normal distributions (with Y = [0, ∞)), and uniform and Beta distributions (with Y = [0, 1]). [sent-47, score-0.108]

20 The following deﬁnition describes distributions on X × Y whose conditional distributions P( · |x), x ∈ X, have the same τ -quantile type α. [sent-48, score-0.165]

21 A distribution P on X ×Y is said to have a τ -quantile of p-average type α, if Qx := P( · |x) has PX -almost surely a τ -quantile type α and b : X → (0, ∞) deﬁned by b(x) := cP( · |x) , where cP( · |x) is the constant in (3), satisﬁes b−1 ∈ Lp (PX ). [sent-51, score-0.201]

22 Let us now give some examples for distributions having τ -quantiles of p-average type α. [sent-52, score-0.125]

23 3 Let P be a distribution on X × R with marginal distribution PX and regular conditional probability Qx (−∞, y] := 1/(1+e−z ), y ∈ R, where z := y−m(x) /σ(x), m : X → R describes a location shift, and σ : X → [β, 1/β] describes a scale modiﬁcation for some constant β ∈ (0, 1]. [sent-54, score-0.13]

24 Some calculations show b(x) = min hQx (t∗ (x)−α), hQx (t∗ (x)+α) = min u1 (x) u2 (x) , 2 (1+u1 (x)) (1+u2 (x))2 ∈ cα,β , 1 , 4 where u1 (x) := 1−τ e−α/σ(x) , u2 (x) := 1−τ eα/σ(x) and cα,β > 0 can be chosen independent of x, τ τ because σ(x) ∈ [β, 1/β]. [sent-59, score-0.05]

25 Hence b−1 ∈ L∞ (PX ) and P has a τ -quantile of ∞-average type α. [sent-60, score-0.076]

26 4 Let P be a distribution on X × Y with marginal distribution PX and regular con˜ x := P(· | x) on Y . [sent-62, score-0.095]

27 Furthermore, assume that Qx is PX -almost surely ˜ ˜ ˜ ditional probability Q of τ -quantile type α. [sent-63, score-0.123]

28 Let us now consider the family of distributions P with marginal distribu˜ ˜ tion PX and regular conditional distributions Qx := P (· − m(x))/σ(x) x , x ∈ X, where m : X → R and σ : X → (β, 1/β) are as in the previous example. [sent-64, score-0.166]

29 Then Qx has a τ -quantile ∗ ∗ fτ,Qx = m(x) + σ(x)fτ,Q of type αβ, because we obtain for s ∈ [0, αβ] the inequality ˜ x ∗ ∗ Qx (fτ,Qx , fτ,Qx ∗ ∗ ˜ + s) = Qx (fτ,Qx , fτ,Qx + s/σ(x)) ≥ b(x)s/σ(x) ≥ b(x)βs . [sent-65, score-0.143]

30 ˜ ˜ ˜ Consequently, P has a τ -quantile of p-average type αβ if and only if P does have a τ -quantile of p-average type α. [sent-66, score-0.152]

31 The following theorem shows that for distributions having a quantile of p-average type the conditional quantile can be estimated by functions that approximately minimize the pinball risk. [sent-67, score-0.597]

32 Moreover, let P be a distribution on X × Y that has a τ -quantile of p-average type α. [sent-70, score-0.131]

33 Then for all f : X → R p+2 2p satisfying RLτ ,P (f ) − R∗ τ ,P ≤ 2− p+1 α p+1 we have L √ 1/2 ∗ f − fτ,P Lq (PX ) ≤ 2 b−1 Lp (PX ) RLτ ,P (f ) − R∗ τ ,P . [sent-71, score-0.048]

34 L Our next goal is to establish an oracle inequality for SVMs deﬁned by (1). [sent-72, score-0.167]

35 Then we have Lτ (y, t) ≤ Lτ (y, t) for all y ∈ Y , t ∈ R, where t denotes t clipped ¯ ¯ to the interval [−1, 1], i. [sent-74, score-0.062]

36 Since this yields RLτ ,P (f ) ≤ RLτ ,P (f ) for ¯ all functions f : X → R we will focus on clipped functions f in the following. [sent-77, score-0.082]

37 To describe the approximation error of SVMs we need the approximation error function A(λ) := inf f ∈H λ f 2 + H RLτ ,P (f ) − R∗ τ ,P , λ > 0. [sent-78, score-0.07]

38 We also need the covering numbers which for ε > 0 are deﬁned by N BH , ε, L2 (µ) := min n ≥ 1 : ∃ x1 , . [sent-80, score-0.046]

39 With these preparations we can now recall the following oracle inequality shown in more generality in [9]. [sent-92, score-0.211]

40 6 Let P be a distribution on X ×[−1, 1] for which there exist constants v ≥ 1, ϑ ∈ [0, 1] with ϑ 2 ∗ ∗ ¯ ¯ EP Lτ ◦ f − Lτ ◦ fτ,P ≤ v EP (Lτ ◦ f − Lτ ◦ fτ,P ) (5) for all f : X → R. [sent-94, score-0.069]

41 Moreover, let H be a RKHS over X for which there exist ̺ ∈ (0, 1) and a ≥ 1 with sup log N BH , ε, L2 (DX ) ≤ aε−2̺ , ε > 0. [sent-95, score-0.064]

42 Moreover, [9] showed that oracle inequalities of the above type can be used to establish learning rates and to investigate data-dependent parameter selection strategies. [sent-97, score-0.198]

43 For example if we assume that ¯ there exist constants c > 0 and β ∈ (0, 1] such that A(λ) ≤ cλβ for all λ > 0 then RLτ ,P (fT,λn ) β 2β ∗ −γ converges to RLτ ,P with rate n where γ := min { β(2−ϑ+̺(ϑ−1))+̺ , β+1 } and λn = n−γ/β . [sent-98, score-0.092]

44 Let us now consider how these learning rates in terms of risks translate ∗ ¯ into rates for fT,λ − fτ,P Lq (PX ) . [sent-100, score-0.118]

45 To this end we assume that P has a τ -quantile of p-average type α for τ ∈ (0, 1). [sent-101, score-0.114]

46 5 we then obtain ∗ ¯ EP Lτ ◦f −Lτ ◦fτ,P 2 ¯ ∗ ¯ ∗ ≤ EP |f −fτ,P |2 ≤ f −fτ,P p+2 2−q ¯ ∗ q ∞ EP |f −fτ,P | ¯ ≤ c RLτ ,P (f )−R∗ τ ,P L q/2 2p ¯ for all f satisfying RLτ ,P (f )−R∗ τ ,P ≤ 2− p+1 α p+1 , i. [sent-103, score-0.048]

47 we have a variance bound (5) for ϑ := q/2 L ¯ and clipped functions with small excess risk. [sent-105, score-0.108]

48 To illustrate the latter let us assume that H is a Sobolev space W m (X) of order m ∈ N over X, where X is the unit ball in Rd . [sent-107, score-0.075]

49 Then the learning rate for fT,λ − fτ,P Lq (PX ) be−1/(4−q(1−̺)) −2m/(6m+d) comes n , which for ∞-average type distributions reduces to n ≈ n−1/3 . [sent-110, score-0.103]

50 Let us ﬁnally investigate whether the ǫ-insensitive loss deﬁned by L(y, t) := max{0, |y − t| − ǫ} for y, t ∈ R and ﬁxed ǫ > 0, can be used to estimate the median, i. [sent-111, score-0.111]

51 7 Let L be the ǫ-insensitive loss for some ǫ > 0 and P be a distribution on X ×R which ∗ has a unique median f1/2,P . [sent-115, score-0.145]

52 Furthermore, assume that all conditional distributions P(·|x), x ∈ X, are atom-free, i. [sent-116, score-0.08]

53 P(h(x)+A|x) = P(h(x)−A|x) for all measurable A ⊂ R and a suitable function h : X → R. [sent-120, score-0.075]

54 If for the conditional distributions ∗ ∗ have a positive mass concentrated around f1/2,P ± ǫ then f1/2,P is the only minimizer of RL,P . [sent-121, score-0.092]

55 Note that using [7] one can show that for distributions speciﬁed in the above theorem the ∗ SVM using the ǫ-insensitive loss approximates f1/2,P whenever the SVM is RL,P -consistent, i. [sent-122, score-0.168]

56 3 Proofs Let us ﬁrst recall some notions from [7] who investigated surrogate losses in general and the question how approximate risk minimizers approximate exact risk minimizers in particular. [sent-127, score-0.366]

57 To this end let L : X × Y × R → [0, ∞) be a measurable function which we call a loss in the following. [sent-128, score-0.197]

58 For a distribution P and an f : X → R the L-risk is then deﬁned by RL,P (f ) := E(x,y)∼P L(x, y, f (x)), and, as usual, the Bayes L-risk, is denoted by R∗ := inf RL,P (f ), where the inﬁmum is taken over L,P all (measurable) f : X → R. [sent-129, score-0.09]

59 In addition, given a distribution Q on Y the inner L-risks were deﬁned by CL,Q,x (t) := Y L(x, y, t) dQ(y), x ∈ X, t ∈ R, and the minimal inner L-risks were denoted by ∗ CL,Q,x := inf CL,Q,x (t), x ∈ X, where the inﬁmum is taken over all t ∈ R. [sent-130, score-0.172]

60 2] further shows that x → CL,P( · |x),x is measurable if the σ-algebra on X is ∗ complete. [sent-133, score-0.075]

61 the Bayes L-risk is obtained by minimizing the inner risks and subsequently integrating with respect to the marginal distribution PX . [sent-136, score-0.18]

62 In particular, it turned out that the sets of ε-approximate minimizers ∗ ML,Q,x (ε) := t ∈ R : CL,Q,x (t) < CL,Q,x + ε , ε ∈ [0, ∞], and the set of exact minimizers ML,Q,x (0+ ) := ε>0 ML,Q,x (ε) play a crucial role. [sent-138, score-0.216]

63 Now assume we have two losses Ltar : X × Y × R → [0, ∞] and Lsur : X × Y × R → [0, ∞], and that our goal is to estimate the excess Ltar -risk by the excess Lsur -risk. [sent-140, score-0.134]

64 This issue was investigated in [7], where the main device was the so-called calibration function δmax ( · , Q, x) deﬁned by δmax (ε, Q, x) := ∗ inf t∈R\MLtar ,Q,x (ε) CLsur ,Q,x (t) − CLsur ,Q,x ∞ ∗ if CLsur ,Q,x < ∞ , ∗ if CLsur ,Q,x = ∞ , for all ε ∈ [0, ∞]. [sent-141, score-0.096]

65 In the following we sometimes write δmax,Ltar ,Lsur (ε, Q, x) := δmax (ε, Q, x) whenever we need to explicitly mention the target and surrogate losses. [sent-142, score-0.065]

66 Before we use (8) to establish an inequality between the excess risks of Ltar and Lsur , we ﬁnally recall that the Fenchel-Legendre bi-conjugate g ∗∗ : I → [0, ∞] of a function g : I → [0, ∞] deﬁned on an interval I is the largest convex function h : I → [0, ∞] satisfying h ≤ g. [sent-146, score-0.334]

67 With these preparations we can now establish the following generalization of [7, Theorem 2. [sent-148, score-0.075]

68 1 Let P be a distribution on X × Y with R∗ tar ,P < ∞ and R∗ sur ,P < ∞ and assume L L that there exist p ∈ (0, ∞] and functions b : X → [0, ∞] and δ : [0, ∞) → [0, ∞) such that and b −1 δmax (ε, P( · |x), x) ≥ b(x) δ(ε) , ε ≥ 0, x ∈ X, (9) p q ¯ ∈ Lp (PX ). [sent-151, score-0.227]

69 Then for q := p+1 , δ := δ : [0, ∞) → [0, ∞), and all f : X → R we have ¯ δ ∗∗ RLtar ,P (f ) − R∗ tar ,P L b−1 ≤ q Lp (PX ) RLsur ,P (f ) − R∗ sur ,P L q . [sent-152, score-0.16]

70 13] we then see that there hence we assume δ exist constants c1 , c2 ∈ (0, ∞) satisfying t ≤ c1 δ ∗∗ (t) + c2 for all t ∈ [0, ∞]. [sent-159, score-0.139]

71 From this we obtain ∞ = RLtar ,P (f ) − R∗ tar ,P ≤ c1 L ≤ c1 b(x) X −q X ∗ ¯ δ ∗∗ CLtar ,P( · |x),x (t) − CLtar ,P( · |x),x dPX (x) + c2 ∗ CLsur ,P( · |x),x f (x) − CLsur ,P( · |x),x q dPX (x) + c2 , where the last step is analogous to our considerations for RLtar ,P (f ) < ∞. [sent-160, score-0.117]

72 By b−1 ∈ Lp (PX ) and H¨ lder’s inequality we then conclude RLsur ,P (f ) − R∗ sur ,P = ∞. [sent-161, score-0.11]

73 o L Our next goal is to determine the inner risks and their minimizers for the pinball loss. [sent-162, score-0.362]

74 8]) that given a distribution Q on R and a non-negative function g : X → [0, ∞) we have ∞ g dQ = 0 R Q(g ≥ s) ds . [sent-166, score-0.668]

75 Then there exist q+ , q− ∈ [0, ∞) with q+ + q− = Q({t∗ }), and for all t ≥ 0 we have t CLτ ,Q (t∗ + t) − CLτ ,Q (t∗ ) = CLτ ,Q (t∗ − t) − CLτ ,Q (t∗ ) = Q (t∗ , t∗ + s) ds , tq+ + and (12) 0 t tq− + 0 ∗ Q (t∗ − s, t∗ ) ds . [sent-169, score-1.325]

76 (13) ∗ Proof: Let us consider the distribution Q(t ) deﬁned by Q(t ) (A) := Q(t∗ + A) for all measurable ∗ A ⊂ R. [sent-170, score-0.117]

77 Moreover, we obviously have CLτ ,Q (t∗ + t) = CLτ ,Q(t∗ ) (t) and hence we may assume without loss of generality that t∗ = 0. [sent-172, score-0.127]

78 , there exists a q+ satisfying 0 ≤ q+ ≤ Q({0}) and Q((−∞, 0]) = τ + q+ . [sent-175, score-0.048]

79 (14) Let us now compute the inner risks of Lτ . [sent-176, score-0.159]

80 Following [7] we now deﬁne the self-calibration loss of L by ˘ L(Q, t) := |t − t∗ | , Q ∈ Qmin (L), t ∈ R . [sent-179, score-0.067]

81 (15) L,Q This loss is a so-called template loss in the sense of [7], i. [sent-180, score-0.134]

82 , for a given distribution P on X × Y , where X has a complete σ-algebra and P( · |x) ∈ Qmin (L) for PX -almost all x ∈ X, the P-instance ˘ LP (x, t) := |t − t∗ · |x) | is measurable and hence a loss. [sent-182, score-0.119]

83 [7] extended the deﬁnition of inner risks L,P( ˘ to the self-calibration loss by setting CL,Q (t) := L(Q, t), and based on this the minimal inner risks ˘ and their (approximate) minimizers were deﬁned in the obvious way. [sent-183, score-0.449]

84 Moreover, the self-calibration ∗ function was deﬁned by δmax,L,L (ε, Q) = inf t∈R; |t−t∗ |≥ε CL,Q (t) − CL,Q . [sent-184, score-0.07]

85 it measures how well approximate L-risk minimizers t approximate the true minimizer t∗ , and L,Q ˘ second it equals the calibration function of the P-instance LP , i. [sent-187, score-0.164]

86 2 show ε δmax,L,L (ε, Q) = min εq+ + ˘ ε Q (t∗ , t∗ + s) ds, εq− + 0 0 Q (t∗ − s, t∗ ) ds , ε ≥ 0, where q+ and q− are the real numbers deﬁned in Proposition 3. [sent-195, score-0.694]

87 Let us additionally assume that the τ -quantile t∗ is of type α. [sent-197, score-0.116]

88 For the Huber type function δ(ε) := ε2 /2 if ε ∈ [0, α], and δ(ε) := αε − α2 /2 if ε > α, a simple calculation then yields δmax,L,L (ε, Q) ≥ cQ δ(ε), where cQ is the ˘ ¯ ¯ constant satisfying (3). [sent-198, score-0.144]

89 ˆ ˆ To this end we deﬁne δ(ε) := sp ε2 if ε ∈ 0, sp ap and δ(ε) := ap ε − sp+2 ap if ε > sp ap , p p ˆ where ap := αq and sp := 2−q/p . [sent-202, score-0.62]

90 3 Let Q be a symmetric, atom-free distribution on R with median t∗ = 0. [sent-210, score-0.078]

91 (18) a Moreover, the deﬁnition of L implies t−ǫ CL,Q (t) = ∞ t − y − ǫ dQ(y) + −∞ t−ǫ −∞ Using the symmetry of Q yields − t−ǫ CL,Q (t) = y − ǫ − t dQ(y) . [sent-215, score-0.071]

92 y dQ(y) and hence we obtain t+ǫ Q(−∞, t − ǫ]ds − 0 ∞ ǫ−t y dQ(y) = t+ǫ t+ǫ Q[t + ǫ, ∞)ds + 0 ∞ y dQ(y) + 2 ǫ−t t+ǫ ǫ−t Let us ﬁrst consider the case t ≥ ǫ. [sent-216, score-0.046]

93 Then the symmetry of Q yields and hence (18) implies t−ǫ CL,Q (t) t−ǫ Q[ǫ − t, ∞)ds + = 0 t+ǫ Q[s, ∞) ds + y dQ(y) = y dQ(y), Q[s, t+ǫ] ds 0 t+ǫ 0 t+ǫ t−ǫ t+ǫ Q[t−ǫ, t+ǫ] ds + ∞ +2 y dQ(y) . [sent-217, score-2.039]

94 0 Q[s, t − ǫ] ds follows ∞ Q[s, ∞) ds+ 0 Q[s, ∞) ds . [sent-220, score-1.296]

95 Q[0, t − ǫ] ds, and we get t−ǫ t−ǫ Q[0, s) ds + 0 0 Q[s, ∞) ds . [sent-221, score-1.296]

96 This and ∞ t+ǫ ∞ Q[s, ∞) ds + 0 ∞ Q[s, ∞) ds = 2 t+ǫ t+ǫ Q[s, ∞) ds + 0 Q[s, ∞) ds yields t−ǫ CL,Q (t) = 2 t−ǫ 0 ∞ Q[s, ∞) ds + 2 Q[0, s) ds + 0 t+ǫ t+ǫ Q[s, ∞) ds + Q[s, ∞) ds . [sent-222, score-5.204]

97 0 By t−ǫ t+ǫ Q[s, ∞) ds + 0 0 t−ǫ Q[s, ∞) ds = 2 t+ǫ Q[s, ∞) ds + 0 t−ǫ Q[s, ∞) ds we obtain t−ǫ CL,Q (t) = 2 ∞ Q[0, ∞) ds + 2 0 t+ǫ t+ǫ Q[s, ∞) ds + t−ǫ Q[s, ∞) ds if t ≥ ǫ. [sent-223, score-4.536]

98 Analogously we obtain from (19) that ǫ−t CL,Q (t) ǫ+t Q[ǫ − t, t + ǫ] ds + = 0 ǫ+t +2 0 Q[ǫ + t, ∞) ds − ∞ Q[s, ∞) ds Q[s, t + ǫ] ds + 2 ǫ−t ǫ−t 0 ǫ+t ǫ+t Q[ǫ − t, ∞) ds − 0 Q[ǫ + t, ∞) ds . [sent-225, score-3.888]

99 Hence one exact minimizer of CL,Q (·) is the median t∗ = 0. [sent-230, score-0.088]

100 The last assertion is a direct consequence of the formula for CL,Q (t) − CL,Q (0) in the case t ∈ (0, ǫ]. [sent-231, score-0.045]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ds', 0.648), ('clsur', 0.292), ('px', 0.248), ('dq', 0.201), ('cltar', 0.175), ('rl', 0.145), ('quantile', 0.145), ('hqx', 0.136), ('rltar', 0.136), ('dpx', 0.136), ('qx', 0.136), ('cq', 0.117), ('pinball', 0.117), ('tar', 0.117), ('minimizers', 0.108), ('risks', 0.096), ('bh', 0.087), ('fd', 0.087), ('tq', 0.078), ('cl', 0.078), ('type', 0.076), ('measurable', 0.075), ('lp', 0.074), ('inf', 0.07), ('ap', 0.068), ('inequality', 0.067), ('loss', 0.067), ('sp', 0.065), ('oracle', 0.064), ('lq', 0.062), ('clipped', 0.062), ('lsur', 0.058), ('ltar', 0.058), ('qmin', 0.058), ('rlsur', 0.058), ('median', 0.058), ('ft', 0.052), ('moreover', 0.052), ('theorem', 0.052), ('steinwart', 0.051), ('satisfying', 0.048), ('hq', 0.046), ('excess', 0.046), ('sur', 0.043), ('ep', 0.043), ('svms', 0.042), ('inner', 0.041), ('recall', 0.041), ('quantiles', 0.041), ('alamos', 0.039), ('christmann', 0.039), ('preparations', 0.039), ('cp', 0.039), ('mum', 0.036), ('establish', 0.036), ('conditional', 0.035), ('let', 0.035), ('rkhs', 0.035), ('ingo', 0.034), ('dx', 0.034), ('svm', 0.033), ('regular', 0.032), ('hush', 0.031), ('minimizer', 0.03), ('symmetry', 0.029), ('lder', 0.029), ('surely', 0.029), ('exist', 0.029), ('lebesgue', 0.027), ('los', 0.027), ('distributions', 0.027), ('nally', 0.026), ('assertion', 0.026), ('calibration', 0.026), ('min', 0.025), ('surrogate', 0.025), ('omit', 0.024), ('losses', 0.024), ('hence', 0.024), ('proposition', 0.023), ('marginal', 0.023), ('us', 0.022), ('proof', 0.022), ('whenever', 0.022), ('implies', 0.022), ('investigate', 0.022), ('numbers', 0.021), ('distribution', 0.02), ('constants', 0.02), ('yields', 0.02), ('end', 0.02), ('risk', 0.019), ('nition', 0.019), ('formula', 0.019), ('bayes', 0.019), ('max', 0.018), ('obviously', 0.018), ('write', 0.018), ('assume', 0.018), ('xn', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 101 nips-2007-How SVMs can estimate quantiles and the median

Author: Andreas Christmann, Ingo Steinwart

2 0.11845171 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classiﬁer from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1

3 0.084267214 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1

4 0.067532137 118 nips-2007-Learning with Transformation Invariant Kernels

Author: Christian Walder, Olivier Chapelle

Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive deﬁnite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive deﬁnite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classiﬁer with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1

5 0.064728506 108 nips-2007-Kernel Measures of Conditional Dependence

Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf

Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of inﬁnite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1

6 0.063649744 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

7 0.056900889 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

8 0.048629411 7 nips-2007-A Kernel Statistical Test of Independence

9 0.046578165 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation

10 0.044699885 205 nips-2007-Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity

11 0.04440248 199 nips-2007-The Price of Bandit Information for Online Optimization

12 0.043490943 22 nips-2007-Agreement-Based Learning

13 0.042904142 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

14 0.042112578 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

15 0.041712444 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

16 0.041336522 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation

17 0.041307814 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data

18 0.039667778 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models

19 0.039288409 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

20 0.037910968 159 nips-2007-Progressive mixture rules are deviation suboptimal

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.106), (1, -0.036), (2, -0.036), (3, 0.063), (4, 0.016), (5, -0.045), (6, 0.024), (7, 0.001), (8, -0.035), (9, 0.011), (10, 0.07), (11, -0.02), (12, -0.012), (13, 0.004), (14, -0.029), (15, -0.112), (16, 0.137), (17, 0.024), (18, -0.008), (19, -0.006), (20, 0.05), (21, -0.045), (22, 0.035), (23, -0.152), (24, -0.069), (25, -0.011), (26, 0.059), (27, -0.066), (28, 0.048), (29, -0.0), (30, -0.093), (31, -0.066), (32, 0.001), (33, -0.0), (34, 0.007), (35, 0.089), (36, -0.156), (37, 0.026), (38, -0.1), (39, -0.107), (40, -0.154), (41, -0.021), (42, -0.033), (43, 0.0), (44, 0.037), (45, 0.024), (46, -0.019), (47, -0.075), (48, -0.06), (49, -0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96634793 101 nips-2007-How SVMs can estimate quantiles and the median

Author: Andreas Christmann, Ingo Steinwart

2 0.6279667 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

3 0.51560563 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan

Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1

4 0.45001656 108 nips-2007-Kernel Measures of Conditional Dependence

Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf

5 0.43765432 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski

Abstract: Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a ﬁnite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee ﬁnite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory. Optimization is the general problem of ﬁnding a value of a vector of variables θ that maximizes (or minimizes) some scalar criterion U (θ). The set of all possible values of the vector θ is called the optimization domain. The elements of θ can be discrete or continuous variables. In the ﬁrst case the optimization domain is usually ﬁnite, such as in the well-known traveling salesman problem; in the second case the optimization domain is a continuous set. An important example of a continuous optimization domain is the set of 3-D conﬁgurations of a sequence of amino-acids in the problem of ﬁnding the minimum energy folding of the corresponding protein [1]. In principle, any optimization problem on a ﬁnite domain can be solved by an exhaustive search. However, this is often beyond computational capacity: the optimization domain of the traveling salesman problem with 100 cities contains more than 10155 possible tours. An efﬁcient algorithm to solve the traveling salesman and many similar problems has not yet been found and such problems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used methods for ﬁnding good approximate solutions in hard discrete optimization problems which defy efﬁcient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a random search based on the Metropolis-Hastings algorithm, such that the distribution of the elements of the domain visited during the search converges to an equilibrium distribution concentrated around the global optimizers. Convergence and ﬁnite-time performance of simulated annealing on ﬁnite domains has been evaluated in many works, e.g. [7, 8, 9, 10]. On continuous domains, most popular optimization methods perform a local gradient-based search and in general converge to local optimizers; with the notable exception of convex criteria where convergence to the unique global optimizer occurs [11]. Simulated annealing performs a global search and can be easily implemented on continuous domains. Hence it can be considered a powerful complement to local methods. In this paper, we introduce for the ﬁrst time rigorous guarantees on the ﬁnite-time performance of simulated annealing on continuous domains. We will show that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level of conﬁdence, ﬁnd an approximate solution to the problem of optimizing a function of continuous variables, within a speciﬁed tolerance to the global optimal solution after a known ﬁnite number of steps. Rigorous guarantees on the ﬁnite-time performance of simulated annealing in the optimization of functions of continuous variables have never been obtained before; the only results available state that simulated annealing converges to a global optimizer as the number of steps grows to inﬁnity, e.g. [12, 13, 14, 15]. The background of our work is twofold. On the one hand, our notion of approximate solution to a global optimization problem is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory [16, 17]. We actually maintain an important aspect of statistical learning theory which is that we do not introduce any particular assumption on the optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground our results on the theory of convergence, with quantitative bounds on the distance to the target distribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods, which has been one of the main achievements of recent research in statistics [18, 19, 20, 21]. In this paper, we will not develop any ready-to-use optimization algorithm. We will instead introduce a general formulation of the simulated annealing method which allows one to derive new simulated annealing algorithms with rigorous ﬁnite-time guarantees on the basis of existing theory. The Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of freedom. The choice and comparison of speciﬁc algorithms goes beyond the scope of the paper. The paper is organized in the following sections. In Simulated annealing we introduce the method and ﬁx the notation. In Convergence we recall the reasons why ﬁnite-time guarantees for simulated annealing on continuous domains have not been obtained before. In Finite-time guarantees we present the main result of the paper. In Conclusions we state our ﬁndings and conclude the paper. 1 Simulated annealing The original formulation of simulated annealing was inspired by the analogy between the stochastic evolution of the thermodynamic state of an annealing material towards the conﬁgurations of minimal energy and the search for the global minimum of an optimization criterion [3]. In the procedure, the optimization criterion plays the role of the energy and the state of the annealed material is simulated by the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves according to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of thermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of temperatures (“cooling”). The target distribution of the cooling procedure is the limiting Boltzmann distribution, for the temperature that tends to zero, which takes non-zero values only on the set of global minimizers [7]. The original formulation of the method was for a ﬁnite domain. However, simulated annealing can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings algorithm can be used with almost no differences on discrete and continuous domains The main difference is that on a continuous domain the equilibrium distributions are speciﬁed by probability densities. On a continuous domain, Markov transition kernels in which the distribution of the elements visited by the chain converges to an equilibrium distribution with the desired density can be constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods [22]. We point out that Boltzmann distributions are not the only distributions which can be adopted as equilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a different type of equilibrium distribution in place of Boltzmann distributions. 1.1 Our setting The optimization criterion is U : Θ → [0, 1], with Θ ⊂ RN . The assumption that U takes values in the interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any bounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimization task is to ﬁnd a global maximizer; this can be done without loss of generality. We also assume that Θ is a bounded set. We consider equilibrium distributions deﬁned by probability density functions proportional to [U (θ) + δ]J where J and δ are two strictly positive parameters. We use π (J) to denote an equilibrium distribution, i.e. π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ) where πLeb is the standard Lebesgue measure. Here, J −1 plays the role of the temperature: if the function U (θ) plus δ is taken to a positive power J then as J increases (i.e. as J −1 decreases) [U (θ) + δ]J becomes increasingly peaked around the global maximizers. The parameter δ is an offset which guarantees that the equilibrium densities are always strictly positive, even if U takes zero values on some elements of the domain. The offset δ is chosen by the user and we show later that our results allow one to make an optimal selection of δ. The zero-temperature distribution is the limiting distribution, for J → ∞, which takes non-zero values only on the set of global maximizers. It is denoted by π (∞) . In the generic formulation of the method, the Markov transition kernel of the k-th step of the inhomogeneous chain has equilibrium distribution π (Jk ) where {Jk }k=1,2,... is the “cooling schedule”. The cooling schedule is a non-decreasing sequence of positive numbers according to which the equilibrium distribution become increasingly sharpened during the evolution of the chain. We use θk to denote the state of the chain and Pθk to denote its probability distribution. The distribution Pθk obviously depends on the initial condition θ0 . However, in this work, we don’t need to make this dependence explicit in the notation. Remark 1: If, given an element θ in Θ, the value U (θ) can be computed directly, we say that U is a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems involving random variables, the value U (θ) may be the expected value U (θ) = g(x, θ)px (x; θ)dx of some function g which depends on both the optimization variable θ, and on some random variable x which has probability density px (x; θ) (which may itself depend on θ). In such problems it is usually not possible to compute U (θ) directly, either because evaluation of the integral requires too much computation, or because no analytical expression for px (x; θ) is available. Typically one must perform stochastic simulations in order to obtain samples of x for a given θ, hence obtain sample values of g(x, θ), and thus construct a Monte Carlo estimate of U (θ). The Bayesian design of clinical trials is an important application area where such expected-value criteria arise [23]. The authors of this paper investigate the optimization of expected-value criteria motivated by problems of aircraft routing [24]. In the particular case that px (x; θ) does not depend on θ, the optimization task is often called “empirical risk minimization”, and is studied extensively in statistical learning theory [16, 17]. The results of this paper apply in the same way to the optimization of both deterministic and expected-value criteria. The MCMC method developed by M¨ ller [25, 26] allows one u to construct simulated annealing algorithms for the optimization of expected-value criteria. M¨ ller u [25, 26] employs the same equilibrium distributions as those described in our setting; in his context J is restricted to integer values. 2 Convergence The rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J, then the distribution of the state of the chain Pθk tends to the equilibrium distribution π (J) ; if J → ∞ then the equilibrium distribution π (J) tends to the zero-temperature distribution π (∞) ; as a result, if the cooling schedule Jk tends to inﬁnity, one obtains that Pθk “follows” π (Jk ) and that π (Jk ) tends to π (∞) and eventually that the distribution of the state of the chain Pθk tends to π (∞) . The theory shows that, under conditions on the cooling schedule and the Markov transition kernels, the distribution of the state of the chain Pθk actually converges to the target zero-temperature distribution π (∞) as k → ∞ [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that asymptotically the state of the chain eventually coincides with a global optimizer with probability one. The difﬁculty which must be overcome in order to obtain ﬁnite step results on simulated annealing algorithms on a continuous domain is that usually, in an optimization problem deﬁned over continuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated points). If the set of global optimizers has zero measure then the set of global optimizers has null probability according to the equilibrium distributions π (J) for any ﬁnite J and, as a consequence, according to the distributions Pθk for any ﬁnite k. Put another way, the probability that the state of the chain visits the set of global optimizers is constantly zero after any ﬁnite number of steps. Hence the conﬁdence of the fact that the solution provided by the algorithm in ﬁnite time coincides with a global optimizer is also constantly zero. Notice that this is not the case for a ﬁnite domain, where the set of global optimizers is of non-null measure with respect to the reference counting measure [7, 8, 9, 10]. It is instructive to look at the issue also in terms of the rate of convergence to the target zerotemperature distribution. On a discrete domain, the distribution of the state of the chain at each step and the zero-temperature distribution are both standard discrete distributions. It is then possible to deﬁne a distance between them and study the rate of convergence of this distance to zero. This analysis allows one to obtain results on the ﬁnite-time behavior of simulated annealing [7, 8]. On a continuous domain and for a set of global optimizers of measure zero, the target zero-temperature distribution π (∞) ends up being a mixture of probability masses on the set of global optimizers. In this situation, although the distribution of the state of the chain Pθk still converges asymptotically to π (∞) , it is not possible to introduce a sensible distance between the two distributions and a rate of convergence to the target distribution cannot even be deﬁned (weak convergence), see [12, Theorem 3.3]. This is the reason that until now there have been no guarantees on the performance of simulated annealing on a continuous domain after a ﬁnite number of computations: by adopting the zero-temperature distribution π (∞) as the target distribution it is only possible to prove asymptotic convergence in inﬁnite time to a global optimizer. Remark 2: The standard distance between two distributions, say µ1 and µ2 , on a continuous support is the total variation norm µ1 − µ2 T V = supA |µ1 (A) − µ2 (A)|, see e.g. [21]. In simulated annealing on a continuous domain the distribution of the state of the chain Pθk is absolutely continuous with respect to the Lebesgue measure (i.e. πLeb (A) = 0 ⇒ Pθk (A) = 0), by construction for any ﬁnite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero measure also according to Pθk . The set of global optimizers has however measure 1 according to π (∞) . The distance Pθk − π (∞) T V is then constantly 1 for any ﬁnite k. It is also worth mentioning that if the set of global optimizers has zero measure then asymptotic convergence to the zero-temperature distribution π (∞) can be proven only under the additional assumptions of continuity and differentiability of U [12, 13, 14, 15]. 3 Finite-time guarantees In general, optimization algorithms for problems deﬁned on continuous variables can only ﬁnd approximate solutions in ﬁnite time [27]. Given an element θ of a continuous domain how can we assess how good it is as an approximate solution to an optimization problem? Here we introduce the concept of approximate global optimizer to answer this question. The deﬁnition is given for a maximization problem in a continuous but bounded domain. We use two parameters: the value imprecision ǫ (greater than or equal to 0) and the residual domain α (between 0 and 1) which together determine the level of approximation. We say that θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if the function U takes values strictly greater than U (θ) + ǫ only on a subset of values of θ no larger than an α portion of the optimization domain. The formal deﬁnition is as follows. Deﬁnition 1 Let U : Θ → R be an optimization criterion where Θ ⊂ RN is bounded. Let πLeb denote the standard Lebesgue measure. Let ǫ ≥ 0 and α ∈ [0, 1] be given numbers. Then θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . In other words, the value U (θ) is within ǫ of a value which is greater than the values that U takes on at least a 1 − α portion of the domain. The smaller ǫ and α are, the better is the approximation of a true global optimizer. If both α and ǫ are equal to zero then U (θ) coincides with the essential supremum of U . Our deﬁnition of approximate global optimizer carries an important property, which holds regardless of what the criterion U is: if ǫ and α have non-zero values then the set of approximate global optimizers always has non-zero Lebesgue measure. It follows that the probability that the chain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study the conﬁdence of the fact that the solution found by simulated annealing in ﬁnite time is an approximate global optimizer. Remark 3: The intuition that our notion of approximate global optimizer can be used to obtain formal guarantees on the ﬁnite-time performance of optimization methods based on a stochastic search of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28] introduces a similar deﬁnition and obtains rigorous ﬁnite-time guarantees in the optimization of ex- pected value criteria based on uniform independent sampling of the domain. Notably, the number of independent samples required to guarantee some desired accuracy and conﬁdence turns out to be polynomial in the values of the desired imprecision, residual domain and conﬁdence. Although the method of Vidyasagar is not highly sophisticated, it has had considerable success in solving difﬁcult control system design applications [28, 29]. Its appeal stems from its rigorous ﬁnite-time guarantees which exist without the need for any particular assumption on the optimization criterion. Here we show that ﬁnite-time guarantees for simulated annealing can be obtained by selecting a distribution π (J) with a ﬁnite J as the target distribution in place of the zero-temperature distribution π (∞) . The fundamental result is the following theorem which allows one to select in a rigorous way δ and J in the target distribution π (J) . It is important to stress that the result holds universally for any optimization criterion U on a bounded domain. The only minor requirement is that U takes values in [0, 1]. Theorem 1 Let U : Θ → [0, 1] be an optimization criterion where Θ ⊂ RN is bounded. Let J ≥ 1 and δ > 0 be given numbers. Let θ be a multivariate random variable with distribution π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ). Let α ∈ (0, 1] and ǫ ∈ [0, 1] be given numbers and deﬁne σ= 1 1+δ 1+ ǫ+1+δ J 1 1+δ 1+δ −1 α ǫ+δ δ . (1) Then the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ. Proof. See Appendix A. The importance of the choice of a target distribution π (J) with a ﬁnite J is that π (J) is absolutely continuous with respect to the Lebesgue measure. Hence, the distance Pθk − π (J) TV between the distribution of the state of the chain Pθk and the target distribution π (J) is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm is a well studied problem. The theory provides simple conditions under which one derives upper bounds on the distance to the target distribution which are known at each step of the chain and decrease monotonically to zero as the number of steps of the chain grows. The theory has been developed mainly for homogeneous chains [18, 19, 20, 21]. In the case of simulated annealing, the factor that enables us to employ these results is the absolute continuity of the target distribution π (J) with respect to the Lebesgue measure. However, simulated annealing involves the simulation of inhomogeneous chains. In this respect, another important fact is that the choice of a target distribution π (J) with a ﬁnite J implies that the inhomogeneous Markov chain can in fact be formed by a ﬁnite sequence of homogeneous chains (i.e. the cooling schedule {Jk }k=1,2,... can be chosen to be a sequence that takes only a ﬁnite set of values). In turn, this allows one to apply the theory of homogeneous MCMC methods to study the convergence of Pθk to π (J) in total variation norm. On a bounded domain, simple conditions on the ‘proposal distribution’ in the iteration of the simulated annealing algorithm allows one to obtain upper bounds on Pθk − π (J) TV that decrease geometrically to zero as k → ∞, without the need for any additional assumption on U [18, 19, 20, 21]. It is then appropriate to introduce the following ﬁnite-time result. Theorem 2 Let the notation and assumptions of Theorem 1 hold. Let θk , with distribution Pθk , be the state of the inhomogeneous chain of a simulated annealing algorithm with target distribution π (J) . Then the statement “θk is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ − Pθk − π (J) TV . The proof of the theorem follows directly from the deﬁnition of the total variation norm. It follows that if simulated annealing is implemented with an algorithm which converges in total variation distance to a target distribution π (J) with a ﬁnite J, then one can state with conﬁdence arbitrarily close to 1 that the solution found by the algorithm after the known appropriate ﬁnite number of steps is an approximate global optimizer with the desired approximation level. For given non-zero values of ǫ, α the value of σ given by (1) can be made arbitrarily close to 1 by choice of J; while the distance Pθk − π (J) TV can be made arbitrarily small by taking the known sufﬁcient number of steps. It can be shown that there exists the possibility of making an optimal choice of δ and J in the target distribution π (J) . In fact, for given ǫ and α and a given value of J there exists an optimal choice of δ which maximizes the value of σ given by (1). Hence, it is possible to obtain a desired σ with the smallest possible J. The advantage of choosing the smallest J, consistent with the required approximation and conﬁdence, is that it will decrease the number of steps required to achieve the desired reduction of Pθk − π (J) TV . 4 Conclusions We have introduced a new formulation of simulated annealing which admits rigorous ﬁnite-time guarantees in the optimization of functions of continuous variables. First, we have introduced the notion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed to ﬁnd approximate global optimizers, with the desired conﬁdence and the desired level of accuracy, in a known ﬁnite number of steps, if a proper choice of the target distribution is made and conditions for convergence in total variation norm are met. The results hold for any optimization criterion on a bounded domain with the only minor requirement that it takes values between 0 and 1. In this framework, simulated annealing algorithms with rigorous ﬁnite-time guarantees can be derived by studying the choice of the proposal distribution and of the cooling schedule, in the generic iteration of simulated annealing, in order to ensure convergence to the target distribution in total variation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and MCMC methods on continuous domains can be used [18, 19, 20, 21]. Vidyasagar [17, 28] has introduced a similar deﬁnition of approximate global optimizer and has shown that approximate optimizers with desired accuracy and conﬁdence can be obtained with a number of uniform independent samples of the domain which is polynomial in the accuracy and conﬁdence parameters. In general, algorithms developed with the MCMC methodology can be expected to be equally or more efﬁcient than uniform independent sampling. Acknowledgments Work supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects HYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and D. M. Wolpert for discussions and useful comments on the paper. A Proof of Theorem 1 Let α ∈ (0, 1] and ρ ∈ (0, 1] be given numbers. Let Uδ (θ) := U (θ) + δ. Let πδ be a normalized ¯ measure such that πδ (dθ) ∝ Uδ (θ)πLeb (dθ). In the ﬁrst part of the proof we ﬁnd a lower bound on the probability that θ belongs to the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} . ¯ Let yα := inf{y : πδ {θ ∈ Θ : Uδ (θ) ≤ y} ≥ 1 − α}. To start with we show that the set ¯ ¯ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} coincides with {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice ¯ ¯ that the quantity πδ {θ ∈ Θ : Uδ (θ) ≤ y} is a right-continuous non-decreasing function of y because it has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we have πδ {θ ∈ Θ : Uδ (θ) ≤ yα } ≥ 1 − α and ¯ ¯ y ≥ ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} ≥ 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α . ¯ ¯ ¯ Moreover, y < ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} < 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} > α ¯ ¯ ¯ and taking the contrapositive one obtains πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α ⇒ y ≥ ρ yα . ¯ ¯ ′ Therefore {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α}. ¯ ¯ We now derive a lower bound on π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Let us introduce the notation ¯ ¯¯ Aα := {θ ∈ Θ : Uδ (θ) < yα }, Aα := {θ ∈ Θ : Uδ (θ) ≥ yα }, Bα,ρ := {θ ∈ Θ : Uδ (θ) < ρ yα } ¯ ¯ ¯ ¯ ¯ ¯α,ρ := {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice that Bα,ρ ⊆ Aα and Aα ⊆ Bα,ρ . The quantity ¯¯ ¯¯ and B ¯ ¯ ¯ ¯ πδ {θ ∈ Θ : Uδ (θ) < y} as a function of y is the left-continuous version of πδ {θ ∈ Θ : Uδ (θ) ≤ ¯¯ y}[30, p.162]. Hence, the deﬁnition of yα implies πδ (Aα ) ≤ 1 − α and πδ (Aα ) ≥ α. Notice that ¯ ¯ ¯ ¯ δπLeb (Aα ) ¯ ≤ 1−α, ¯ πδ (Aα ) ≤ 1 − α ⇒ ¯ ¯ Uδ (θ)πLeb (dθ) Θ ¯¯ πδ (Aα ) ≥ α ¯ ¯¯ (1 + δ)πLeb (Aα ) ≥ α. ¯ U (θ)πLeb (dθ) Θ δ ⇒ ¯¯ Hence, πLeb (Aα ) > 0 and πLeb (Aα ) 1−α1+δ ¯ ¯ . ¯α ) ≤ α ¯ δ πLeb (A ¯ ¯¯ ¯¯ Notice that πLeb (Aα ) > 0 implies πLeb (Bα,ρ ) > 0. We obtain π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα } = ¯ 1+ 1 ≥ Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Bα,ρ ≥ 1+ J ρ J yα ¯ J yα ¯ Uδ (θ)J πLeb (dθ) 1+ 1 Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Aα Uδ (θ)J πLeb (dθ) 1 1 1 ≥ ≥ . 1−α1+δ ¯ πLeb (Aα ) πLeb (Bα,ρ ) ¯ ¯ 1 + ρJ 1 + ρJ ¯¯ ¯¯ α ¯ δ πLeb (Aα ) πLeb (Aα ) Since {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} the ﬁrst part of ¯ ¯ the proof is complete. In the second part of the proof we show that the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} is contained in the set of approximate global optimizers of U with value imprecision ¯ ǫ := (ρ−1 − 1)(1 + δ) and residual domain α := 1+δ α. Hence, we show that {θ ∈ Θ : πδ {θ′ ∈ ˜ ˜ ǫ+δ ¯ ˜ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} ⊆ {θ ∈ Θ : πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ)}. We ¯ ˜ ˜ have U (θ′ ) > U (θ) + ǫ ⇔ ρ Uδ (θ′ ) > ρ [Uδ (θ) + ǫ] ⇒ ρ Uδ (θ′ ) > Uδ (θ) ˜ ˜ which is proven by noticing that ρ [Uδ (θ) + ǫ] ≥ Uδ (θ) ⇔ 1 − ρ ≥ U (θ)(1 − ρ) ˜ and U (θ) ∈ [0, 1]. Hence {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ⊇ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} . ˜ Therefore πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α . Let ¯ ˜ ¯ Qθ,˜ := {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} and notice that ˜ ǫ U (θ′ )πLeb (dθ′ ) + δπLeb (Qθ,˜) ǫ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} = ˜ Qθ,˜ ǫ . U (θ′ )πLeb (dθ′ ) + δπLeb (Θ) Θ We obtain πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α ⇒ ǫ πLeb (Qθ,˜) + δπLeb (Qθ,˜) ≤ α(1 + δ)πLeb (Θ) ˜ ¯ ˜ ¯ ǫ ǫ ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . ˜ ˜ Hence we can conclude that πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) ¯ ˜ ˜ and the second part of the proof is complete. We have shown that given α ∈ (0, 1], ρ ∈ (0, 1], ǫ := (ρ−1 − 1)(1 + δ), α := ¯ ˜ ˜ σ := 1 = 1−α1+δ ¯ 1+δ 1+ρ 1+ α ¯ δ ǫ+1+δ ˜ J 1+δ ǫ+δ ˜ 1 J 1 1+δ 1+δ −1 α ǫ+δ ˜˜ δ α and ¯ , the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual ˜ domain α” holds with probability at least σ. Notice that ǫ ∈ [0, 1] and α ∈ (0, 1] are linked through ˜ ˜ ˜ ǫ+δ ˜ a bijective relation to ρ ∈ [ 1+δ , 1] and α ∈ (0, 1+δ ]. The statement of the theorem is eventually ¯ 2+δ obtained by expressing σ as a function of desired ǫ = ǫ and α = α. ˜ ˜ References [1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003. [2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization problems. Nature, 435:759–764, 2005. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 220(4598):671–680, 1983. Optimization by Simulated Annealing. Science, [4] E. Bonomi and J. Lutton. The N -city travelling salesman problem: statistical mechanics and the Metropolis algorithm. SIAM Rev., 26(4):551–568, 1984. [5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimization. J. Phys. A: Math. Gen., 19(9):1605–1620, 1986. [6] M. M´ zard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satisﬁability e Problems. Science, 297:812–815, 2002. [7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel Publishing Company, Dordrecht, Holland, 1987. [8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and ﬁnite-time behavior of simulated annealing. Adv. Appl. Prob., 18:747–771, 1986. [9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311–329, 1988. [10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing. Math. Oper. Res., 31(1):199–216, 2006. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866– 893, 1991. [13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization. Algorithmica, 6:419–436, 1991. [14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395–406, 1996. [15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions. J. Optimiz. Theory App., 104(1):121–133, 2000. [16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New York, US, 1995. [17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag, London, second edition, 2003. [18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc., 90(430):558–566, 1995. [20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm. Ann. Stat., 24(1):101–121, 1996. [21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob. Surv., 1:20–71, 2004. [22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition, 2004. [23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons, Chichester, UK, 2004. [24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for Conﬂict Resolution in Air Trafﬁc Control. IEEE Trans. Intell. Transp. Syst., 7(4):470–482, 2006. [25] P. M¨ ller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M. u Smith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages 459–474. Oxford: Clarendon Press, 1999. [26] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simuu o lation. J. Am. Stat. Assoc., 99(467):788–798, 2004. [27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New York, 1998. [28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37(10):1515–1528, 2001. [29] R. Tempo, G. Calaﬁore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain Systems. Springer-Verlag, London, 2005. [30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.

6 0.43051377 159 nips-2007-Progressive mixture rules are deviation suboptimal

7 0.41469508 7 nips-2007-A Kernel Statistical Test of Independence

8 0.40055335 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

9 0.38509938 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

10 0.38341457 118 nips-2007-Learning with Transformation Invariant Kernels

11 0.36565948 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

12 0.3554796 62 nips-2007-Convex Learning with Invariances

13 0.35276639 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

14 0.31247357 193 nips-2007-The Distribution Family of Similarity Distances

15 0.3069016 129 nips-2007-Mining Internet-Scale Software Repositories

16 0.3064957 200 nips-2007-The Tradeoffs of Large Scale Learning

17 0.29843587 21 nips-2007-Adaptive Online Gradient Descent

18 0.28531986 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

19 0.28052002 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

20 0.28020436 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.021), (13, 0.031), (16, 0.032), (19, 0.013), (21, 0.044), (34, 0.024), (35, 0.05), (47, 0.047), (49, 0.033), (67, 0.403), (83, 0.13), (90, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69156289 101 nips-2007-How SVMs can estimate quantiles and the median

Author: Andreas Christmann, Ingo Steinwart

2 0.48703894 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray

Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratiﬁed Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1

3 0.37630016 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our ﬁrst result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

4 0.3731409 40 nips-2007-Bundle Methods for Machine Learning

Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan

Abstract: We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the uniﬁed framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach. 1

5 0.37262341 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difﬁcult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efﬁcient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

6 0.37231919 16 nips-2007-A learning framework for nearest neighbor search

7 0.37198457 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

8 0.37111121 21 nips-2007-Adaptive Online Gradient Descent

9 0.37045565 63 nips-2007-Convex Relaxations of Latent Variable Training

10 0.37042826 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

11 0.37028018 49 nips-2007-Colored Maximum Variance Unfolding

12 0.3692086 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

13 0.36890098 134 nips-2007-Multi-Task Learning via Conic Programming

14 0.36865592 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

15 0.3683154 116 nips-2007-Learning the structure of manifolds using random projections

16 0.36813876 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

17 0.36788404 175 nips-2007-Semi-Supervised Multitask Learning

18 0.36755246 46 nips-2007-Cluster Stability for Finite Samples

19 0.36744964 185 nips-2007-Stable Dual Dynamic Programming

20 0.36739159 200 nips-2007-The Tradeoffs of Large Scale Learning