nips nips2007 nips2007-101 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andreas Christmann, Ingo Steinwart
Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1
Reference: text
sentIndex sentText sentNum sentScore
1 How SVMs can estimate quantiles and the median Ingo Steinwart Information Sciences Group CCS-3 Los Alamos National Laboratory Los Alamos, NM 87545, USA ingo@lanl. [sent-1, score-0.099]
2 be Abstract We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. [sent-5, score-0.351]
3 For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . [sent-6, score-0.28]
4 This result is then used to derive an oracle inequality for an SVM based on the pinball loss. [sent-7, score-0.248]
5 Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . [sent-8, score-0.16]
6 The goal of quantile regression is to estimate the conditional quantile, i. [sent-10, score-0.18]
7 Let us now consider the so-called τ -pinball loss Lτ : R × R → [0, ∞) defined by Lτ (y, t) := ψτ (y − t), where ψτ (r) = (τ − 1)r, if r < 0, and ψτ (r) = τ r, if r ≥ 0. [sent-16, score-0.089]
8 RLτ ,P (fτ,P ) = inf RLτ ,P (f ) =: R∗ τ ,P , where the infiL mum is taken over all f : X → R. [sent-20, score-0.106]
9 Empirical methods estimating quantiles with the help of the pinball loss typically obtain functions fD for which RLτ ,P (fD ) is close to R∗ τ ,P with high probability. [sent-31, score-0.225]
10 18]), and hence there is so far only little justification for using fD as an estimate of the quantile function. [sent-33, score-0.169]
11 Our goal is to address this issue by showing that under certain realistic assumptions on P we have an inequality of the form ∗ f − fτ,P L1 (PX ) ≤ cP RLτ ,P (f ) − R∗ τ ,P . [sent-34, score-0.067]
12 L (2) We then use this inequality to establish an oracle inequality for SVMs defined by (1). [sent-35, score-0.234]
13 In addition, we illustrate how this oracle inequality can be used to obtain learning rates and to justify a datadependent method for finding the hyper-parameter λ and H. [sent-36, score-0.131]
14 Finally, we generalize the methods for establishing (2) to investigate the role of ǫ in the ǫ-insensitive loss used in standard SVM regression. [sent-37, score-0.089]
15 Given a distribution P on X × Y we further assume throughout this paper that the σ-algebra on X is complete with respect to the marginal distribution PX of P, i. [sent-39, score-0.081]
16 1 A distribution Q on R is said to have a τ -quantile of type α > 0 if there exists a τ -quantile t∗ ∈ R and a constant cQ > 0 such that for all s ∈ [0, α] we have Q (t∗ , t∗ + s) ≥ cQ s and Q (t∗ − s, t∗ ) ≥ cQ s . [sent-44, score-0.096]
17 (3) It is not difficult to see that a distribution Q having a τ -quantile of some type α has a unique τ quantile t∗ . [sent-45, score-0.241]
18 Moreover, if Q has a Lebesgue density hQ then Q has a τ -quantile of type α if hQ is bounded away from zero on [t∗ −α, t∗ +α] since we can use cQ := inf{hQ (t) : t ∈ [t∗ −α, t∗ +α]} in (3). [sent-46, score-0.076]
19 This assumption is general enough to cover many distributions used in parametric statistics such as Gaussian, Student’s t, and logistic distributions (with Y = R), Gamma and log-normal distributions (with Y = [0, ∞)), and uniform and Beta distributions (with Y = [0, 1]). [sent-47, score-0.108]
20 The following definition describes distributions on X × Y whose conditional distributions P( · |x), x ∈ X, have the same τ -quantile type α. [sent-48, score-0.165]
21 A distribution P on X ×Y is said to have a τ -quantile of p-average type α, if Qx := P( · |x) has PX -almost surely a τ -quantile type α and b : X → (0, ∞) defined by b(x) := cP( · |x) , where cP( · |x) is the constant in (3), satisfies b−1 ∈ Lp (PX ). [sent-51, score-0.201]
22 Let us now give some examples for distributions having τ -quantiles of p-average type α. [sent-52, score-0.125]
23 3 Let P be a distribution on X × R with marginal distribution PX and regular conditional probability Qx (−∞, y] := 1/(1+e−z ), y ∈ R, where z := y−m(x) /σ(x), m : X → R describes a location shift, and σ : X → [β, 1/β] describes a scale modification for some constant β ∈ (0, 1]. [sent-54, score-0.13]
24 Some calculations show b(x) = min hQx (t∗ (x)−α), hQx (t∗ (x)+α) = min u1 (x) u2 (x) , 2 (1+u1 (x)) (1+u2 (x))2 ∈ cα,β , 1 , 4 where u1 (x) := 1−τ e−α/σ(x) , u2 (x) := 1−τ eα/σ(x) and cα,β > 0 can be chosen independent of x, τ τ because σ(x) ∈ [β, 1/β]. [sent-59, score-0.05]
25 Hence b−1 ∈ L∞ (PX ) and P has a τ -quantile of ∞-average type α. [sent-60, score-0.076]
26 4 Let P be a distribution on X × Y with marginal distribution PX and regular con˜ x := P(· | x) on Y . [sent-62, score-0.095]
27 Furthermore, assume that Qx is PX -almost surely ˜ ˜ ˜ ditional probability Q of τ -quantile type α. [sent-63, score-0.123]
28 Let us now consider the family of distributions P with marginal distribu˜ ˜ tion PX and regular conditional distributions Qx := P (· − m(x))/σ(x) x , x ∈ X, where m : X → R and σ : X → (β, 1/β) are as in the previous example. [sent-64, score-0.166]
29 Then Qx has a τ -quantile ∗ ∗ fτ,Qx = m(x) + σ(x)fτ,Q of type αβ, because we obtain for s ∈ [0, αβ] the inequality ˜ x ∗ ∗ Qx (fτ,Qx , fτ,Qx ∗ ∗ ˜ + s) = Qx (fτ,Qx , fτ,Qx + s/σ(x)) ≥ b(x)s/σ(x) ≥ b(x)βs . [sent-65, score-0.143]
30 ˜ ˜ ˜ Consequently, P has a τ -quantile of p-average type αβ if and only if P does have a τ -quantile of p-average type α. [sent-66, score-0.152]
31 The following theorem shows that for distributions having a quantile of p-average type the conditional quantile can be estimated by functions that approximately minimize the pinball risk. [sent-67, score-0.597]
32 Moreover, let P be a distribution on X × Y that has a τ -quantile of p-average type α. [sent-70, score-0.131]
33 Then for all f : X → R p+2 2p satisfying RLτ ,P (f ) − R∗ τ ,P ≤ 2− p+1 α p+1 we have L √ 1/2 ∗ f − fτ,P Lq (PX ) ≤ 2 b−1 Lp (PX ) RLτ ,P (f ) − R∗ τ ,P . [sent-71, score-0.048]
34 L Our next goal is to establish an oracle inequality for SVMs defined by (1). [sent-72, score-0.167]
35 Then we have Lτ (y, t) ≤ Lτ (y, t) for all y ∈ Y , t ∈ R, where t denotes t clipped ¯ ¯ to the interval [−1, 1], i. [sent-74, score-0.062]
36 Since this yields RLτ ,P (f ) ≤ RLτ ,P (f ) for ¯ all functions f : X → R we will focus on clipped functions f in the following. [sent-77, score-0.082]
37 To describe the approximation error of SVMs we need the approximation error function A(λ) := inf f ∈H λ f 2 + H RLτ ,P (f ) − R∗ τ ,P , λ > 0. [sent-78, score-0.07]
38 We also need the covering numbers which for ε > 0 are defined by N BH , ε, L2 (µ) := min n ≥ 1 : ∃ x1 , . [sent-80, score-0.046]
39 With these preparations we can now recall the following oracle inequality shown in more generality in [9]. [sent-92, score-0.211]
40 6 Let P be a distribution on X ×[−1, 1] for which there exist constants v ≥ 1, ϑ ∈ [0, 1] with ϑ 2 ∗ ∗ ¯ ¯ EP Lτ ◦ f − Lτ ◦ fτ,P ≤ v EP (Lτ ◦ f − Lτ ◦ fτ,P ) (5) for all f : X → R. [sent-94, score-0.069]
41 Moreover, let H be a RKHS over X for which there exist ̺ ∈ (0, 1) and a ≥ 1 with sup log N BH , ε, L2 (DX ) ≤ aε−2̺ , ε > 0. [sent-95, score-0.064]
42 Moreover, [9] showed that oracle inequalities of the above type can be used to establish learning rates and to investigate data-dependent parameter selection strategies. [sent-97, score-0.198]
43 For example if we assume that ¯ there exist constants c > 0 and β ∈ (0, 1] such that A(λ) ≤ cλβ for all λ > 0 then RLτ ,P (fT,λn ) β 2β ∗ −γ converges to RLτ ,P with rate n where γ := min { β(2−ϑ+̺(ϑ−1))+̺ , β+1 } and λn = n−γ/β . [sent-98, score-0.092]
44 Let us now consider how these learning rates in terms of risks translate ∗ ¯ into rates for fT,λ − fτ,P Lq (PX ) . [sent-100, score-0.118]
45 To this end we assume that P has a τ -quantile of p-average type α for τ ∈ (0, 1). [sent-101, score-0.114]
46 5 we then obtain ∗ ¯ EP Lτ ◦f −Lτ ◦fτ,P 2 ¯ ∗ ¯ ∗ ≤ EP |f −fτ,P |2 ≤ f −fτ,P p+2 2−q ¯ ∗ q ∞ EP |f −fτ,P | ¯ ≤ c RLτ ,P (f )−R∗ τ ,P L q/2 2p ¯ for all f satisfying RLτ ,P (f )−R∗ τ ,P ≤ 2− p+1 α p+1 , i. [sent-103, score-0.048]
47 we have a variance bound (5) for ϑ := q/2 L ¯ and clipped functions with small excess risk. [sent-105, score-0.108]
48 To illustrate the latter let us assume that H is a Sobolev space W m (X) of order m ∈ N over X, where X is the unit ball in Rd . [sent-107, score-0.075]
49 Then the learning rate for fT,λ − fτ,P Lq (PX ) be−1/(4−q(1−̺)) −2m/(6m+d) comes n , which for ∞-average type distributions reduces to n ≈ n−1/3 . [sent-110, score-0.103]
50 Let us finally investigate whether the ǫ-insensitive loss defined by L(y, t) := max{0, |y − t| − ǫ} for y, t ∈ R and fixed ǫ > 0, can be used to estimate the median, i. [sent-111, score-0.111]
51 7 Let L be the ǫ-insensitive loss for some ǫ > 0 and P be a distribution on X ×R which ∗ has a unique median f1/2,P . [sent-115, score-0.145]
52 Furthermore, assume that all conditional distributions P(·|x), x ∈ X, are atom-free, i. [sent-116, score-0.08]
53 P(h(x)+A|x) = P(h(x)−A|x) for all measurable A ⊂ R and a suitable function h : X → R. [sent-120, score-0.075]
54 If for the conditional distributions ∗ ∗ have a positive mass concentrated around f1/2,P ± ǫ then f1/2,P is the only minimizer of RL,P . [sent-121, score-0.092]
55 Note that using [7] one can show that for distributions specified in the above theorem the ∗ SVM using the ǫ-insensitive loss approximates f1/2,P whenever the SVM is RL,P -consistent, i. [sent-122, score-0.168]
56 3 Proofs Let us first recall some notions from [7] who investigated surrogate losses in general and the question how approximate risk minimizers approximate exact risk minimizers in particular. [sent-127, score-0.366]
57 To this end let L : X × Y × R → [0, ∞) be a measurable function which we call a loss in the following. [sent-128, score-0.197]
58 For a distribution P and an f : X → R the L-risk is then defined by RL,P (f ) := E(x,y)∼P L(x, y, f (x)), and, as usual, the Bayes L-risk, is denoted by R∗ := inf RL,P (f ), where the infimum is taken over L,P all (measurable) f : X → R. [sent-129, score-0.09]
59 In addition, given a distribution Q on Y the inner L-risks were defined by CL,Q,x (t) := Y L(x, y, t) dQ(y), x ∈ X, t ∈ R, and the minimal inner L-risks were denoted by ∗ CL,Q,x := inf CL,Q,x (t), x ∈ X, where the infimum is taken over all t ∈ R. [sent-130, score-0.172]
60 2] further shows that x → CL,P( · |x),x is measurable if the σ-algebra on X is ∗ complete. [sent-133, score-0.075]
61 the Bayes L-risk is obtained by minimizing the inner risks and subsequently integrating with respect to the marginal distribution PX . [sent-136, score-0.18]
62 In particular, it turned out that the sets of ε-approximate minimizers ∗ ML,Q,x (ε) := t ∈ R : CL,Q,x (t) < CL,Q,x + ε , ε ∈ [0, ∞], and the set of exact minimizers ML,Q,x (0+ ) := ε>0 ML,Q,x (ε) play a crucial role. [sent-138, score-0.216]
63 Now assume we have two losses Ltar : X × Y × R → [0, ∞] and Lsur : X × Y × R → [0, ∞], and that our goal is to estimate the excess Ltar -risk by the excess Lsur -risk. [sent-140, score-0.134]
64 This issue was investigated in [7], where the main device was the so-called calibration function δmax ( · , Q, x) defined by δmax (ε, Q, x) := ∗ inf t∈R\MLtar ,Q,x (ε) CLsur ,Q,x (t) − CLsur ,Q,x ∞ ∗ if CLsur ,Q,x < ∞ , ∗ if CLsur ,Q,x = ∞ , for all ε ∈ [0, ∞]. [sent-141, score-0.096]
65 In the following we sometimes write δmax,Ltar ,Lsur (ε, Q, x) := δmax (ε, Q, x) whenever we need to explicitly mention the target and surrogate losses. [sent-142, score-0.065]
66 Before we use (8) to establish an inequality between the excess risks of Ltar and Lsur , we finally recall that the Fenchel-Legendre bi-conjugate g ∗∗ : I → [0, ∞] of a function g : I → [0, ∞] defined on an interval I is the largest convex function h : I → [0, ∞] satisfying h ≤ g. [sent-146, score-0.334]
67 With these preparations we can now establish the following generalization of [7, Theorem 2. [sent-148, score-0.075]
68 1 Let P be a distribution on X × Y with R∗ tar ,P < ∞ and R∗ sur ,P < ∞ and assume L L that there exist p ∈ (0, ∞] and functions b : X → [0, ∞] and δ : [0, ∞) → [0, ∞) such that and b −1 δmax (ε, P( · |x), x) ≥ b(x) δ(ε) , ε ≥ 0, x ∈ X, (9) p q ¯ ∈ Lp (PX ). [sent-151, score-0.227]
69 Then for q := p+1 , δ := δ : [0, ∞) → [0, ∞), and all f : X → R we have ¯ δ ∗∗ RLtar ,P (f ) − R∗ tar ,P L b−1 ≤ q Lp (PX ) RLsur ,P (f ) − R∗ sur ,P L q . [sent-152, score-0.16]
70 13] we then see that there hence we assume δ exist constants c1 , c2 ∈ (0, ∞) satisfying t ≤ c1 δ ∗∗ (t) + c2 for all t ∈ [0, ∞]. [sent-159, score-0.139]
71 From this we obtain ∞ = RLtar ,P (f ) − R∗ tar ,P ≤ c1 L ≤ c1 b(x) X −q X ∗ ¯ δ ∗∗ CLtar ,P( · |x),x (t) − CLtar ,P( · |x),x dPX (x) + c2 ∗ CLsur ,P( · |x),x f (x) − CLsur ,P( · |x),x q dPX (x) + c2 , where the last step is analogous to our considerations for RLtar ,P (f ) < ∞. [sent-160, score-0.117]
72 By b−1 ∈ Lp (PX ) and H¨ lder’s inequality we then conclude RLsur ,P (f ) − R∗ sur ,P = ∞. [sent-161, score-0.11]
73 o L Our next goal is to determine the inner risks and their minimizers for the pinball loss. [sent-162, score-0.362]
74 8]) that given a distribution Q on R and a non-negative function g : X → [0, ∞) we have ∞ g dQ = 0 R Q(g ≥ s) ds . [sent-166, score-0.668]
75 Then there exist q+ , q− ∈ [0, ∞) with q+ + q− = Q({t∗ }), and for all t ≥ 0 we have t CLτ ,Q (t∗ + t) − CLτ ,Q (t∗ ) = CLτ ,Q (t∗ − t) − CLτ ,Q (t∗ ) = Q (t∗ , t∗ + s) ds , tq+ + and (12) 0 t tq− + 0 ∗ Q (t∗ − s, t∗ ) ds . [sent-169, score-1.325]
76 (13) ∗ Proof: Let us consider the distribution Q(t ) defined by Q(t ) (A) := Q(t∗ + A) for all measurable ∗ A ⊂ R. [sent-170, score-0.117]
77 Moreover, we obviously have CLτ ,Q (t∗ + t) = CLτ ,Q(t∗ ) (t) and hence we may assume without loss of generality that t∗ = 0. [sent-172, score-0.127]
78 , there exists a q+ satisfying 0 ≤ q+ ≤ Q({0}) and Q((−∞, 0]) = τ + q+ . [sent-175, score-0.048]
79 (14) Let us now compute the inner risks of Lτ . [sent-176, score-0.159]
80 Following [7] we now define the self-calibration loss of L by ˘ L(Q, t) := |t − t∗ | , Q ∈ Qmin (L), t ∈ R . [sent-179, score-0.067]
81 (15) L,Q This loss is a so-called template loss in the sense of [7], i. [sent-180, score-0.134]
82 , for a given distribution P on X × Y , where X has a complete σ-algebra and P( · |x) ∈ Qmin (L) for PX -almost all x ∈ X, the P-instance ˘ LP (x, t) := |t − t∗ · |x) | is measurable and hence a loss. [sent-182, score-0.119]
83 [7] extended the definition of inner risks L,P( ˘ to the self-calibration loss by setting CL,Q (t) := L(Q, t), and based on this the minimal inner risks ˘ and their (approximate) minimizers were defined in the obvious way. [sent-183, score-0.449]
84 Moreover, the self-calibration ∗ function was defined by δmax,L,L (ε, Q) = inf t∈R; |t−t∗ |≥ε CL,Q (t) − CL,Q . [sent-184, score-0.07]
85 it measures how well approximate L-risk minimizers t approximate the true minimizer t∗ , and L,Q ˘ second it equals the calibration function of the P-instance LP , i. [sent-187, score-0.164]
86 2 show ε δmax,L,L (ε, Q) = min εq+ + ˘ ε Q (t∗ , t∗ + s) ds, εq− + 0 0 Q (t∗ − s, t∗ ) ds , ε ≥ 0, where q+ and q− are the real numbers defined in Proposition 3. [sent-195, score-0.694]
87 Let us additionally assume that the τ -quantile t∗ is of type α. [sent-197, score-0.116]
88 For the Huber type function δ(ε) := ε2 /2 if ε ∈ [0, α], and δ(ε) := αε − α2 /2 if ε > α, a simple calculation then yields δmax,L,L (ε, Q) ≥ cQ δ(ε), where cQ is the ˘ ¯ ¯ constant satisfying (3). [sent-198, score-0.144]
89 ˆ ˆ To this end we define δ(ε) := sp ε2 if ε ∈ 0, sp ap and δ(ε) := ap ε − sp+2 ap if ε > sp ap , p p ˆ where ap := αq and sp := 2−q/p . [sent-202, score-0.62]
90 3 Let Q be a symmetric, atom-free distribution on R with median t∗ = 0. [sent-210, score-0.078]
91 (18) a Moreover, the definition of L implies t−ǫ CL,Q (t) = ∞ t − y − ǫ dQ(y) + −∞ t−ǫ −∞ Using the symmetry of Q yields − t−ǫ CL,Q (t) = y − ǫ − t dQ(y) . [sent-215, score-0.071]
92 y dQ(y) and hence we obtain t+ǫ Q(−∞, t − ǫ]ds − 0 ∞ ǫ−t y dQ(y) = t+ǫ t+ǫ Q[t + ǫ, ∞)ds + 0 ∞ y dQ(y) + 2 ǫ−t t+ǫ ǫ−t Let us first consider the case t ≥ ǫ. [sent-216, score-0.046]
93 Then the symmetry of Q yields and hence (18) implies t−ǫ CL,Q (t) t−ǫ Q[ǫ − t, ∞)ds + = 0 t+ǫ Q[s, ∞) ds + y dQ(y) = y dQ(y), Q[s, t+ǫ] ds 0 t+ǫ 0 t+ǫ t−ǫ t+ǫ Q[t−ǫ, t+ǫ] ds + ∞ +2 y dQ(y) . [sent-217, score-2.039]
94 0 Q[s, t − ǫ] ds follows ∞ Q[s, ∞) ds+ 0 Q[s, ∞) ds . [sent-220, score-1.296]
95 Q[0, t − ǫ] ds, and we get t−ǫ t−ǫ Q[0, s) ds + 0 0 Q[s, ∞) ds . [sent-221, score-1.296]
96 This and ∞ t+ǫ ∞ Q[s, ∞) ds + 0 ∞ Q[s, ∞) ds = 2 t+ǫ t+ǫ Q[s, ∞) ds + 0 Q[s, ∞) ds yields t−ǫ CL,Q (t) = 2 t−ǫ 0 ∞ Q[s, ∞) ds + 2 Q[0, s) ds + 0 t+ǫ t+ǫ Q[s, ∞) ds + Q[s, ∞) ds . [sent-222, score-5.204]
97 0 By t−ǫ t+ǫ Q[s, ∞) ds + 0 0 t−ǫ Q[s, ∞) ds = 2 t+ǫ Q[s, ∞) ds + 0 t−ǫ Q[s, ∞) ds we obtain t−ǫ CL,Q (t) = 2 ∞ Q[0, ∞) ds + 2 0 t+ǫ t+ǫ Q[s, ∞) ds + t−ǫ Q[s, ∞) ds if t ≥ ǫ. [sent-223, score-4.536]
98 Analogously we obtain from (19) that ǫ−t CL,Q (t) ǫ+t Q[ǫ − t, t + ǫ] ds + = 0 ǫ+t +2 0 Q[ǫ + t, ∞) ds − ∞ Q[s, ∞) ds Q[s, t + ǫ] ds + 2 ǫ−t ǫ−t 0 ǫ+t ǫ+t Q[ǫ − t, ∞) ds − 0 Q[ǫ + t, ∞) ds . [sent-225, score-3.888]
99 Hence one exact minimizer of CL,Q (·) is the median t∗ = 0. [sent-230, score-0.088]
100 The last assertion is a direct consequence of the formula for CL,Q (t) − CL,Q (0) in the case t ∈ (0, ǫ]. [sent-231, score-0.045]
wordName wordTfidf (topN-words)
[('ds', 0.648), ('clsur', 0.292), ('px', 0.248), ('dq', 0.201), ('cltar', 0.175), ('rl', 0.145), ('quantile', 0.145), ('hqx', 0.136), ('rltar', 0.136), ('dpx', 0.136), ('qx', 0.136), ('cq', 0.117), ('pinball', 0.117), ('tar', 0.117), ('minimizers', 0.108), ('risks', 0.096), ('bh', 0.087), ('fd', 0.087), ('tq', 0.078), ('cl', 0.078), ('type', 0.076), ('measurable', 0.075), ('lp', 0.074), ('inf', 0.07), ('ap', 0.068), ('inequality', 0.067), ('loss', 0.067), ('sp', 0.065), ('oracle', 0.064), ('lq', 0.062), ('clipped', 0.062), ('lsur', 0.058), ('ltar', 0.058), ('qmin', 0.058), ('rlsur', 0.058), ('median', 0.058), ('ft', 0.052), ('moreover', 0.052), ('theorem', 0.052), ('steinwart', 0.051), ('satisfying', 0.048), ('hq', 0.046), ('excess', 0.046), ('sur', 0.043), ('ep', 0.043), ('svms', 0.042), ('inner', 0.041), ('recall', 0.041), ('quantiles', 0.041), ('alamos', 0.039), ('christmann', 0.039), ('preparations', 0.039), ('cp', 0.039), ('mum', 0.036), ('establish', 0.036), ('conditional', 0.035), ('let', 0.035), ('rkhs', 0.035), ('ingo', 0.034), ('dx', 0.034), ('svm', 0.033), ('regular', 0.032), ('hush', 0.031), ('minimizer', 0.03), ('symmetry', 0.029), ('lder', 0.029), ('surely', 0.029), ('exist', 0.029), ('lebesgue', 0.027), ('los', 0.027), ('distributions', 0.027), ('nally', 0.026), ('assertion', 0.026), ('calibration', 0.026), ('min', 0.025), ('surrogate', 0.025), ('omit', 0.024), ('losses', 0.024), ('hence', 0.024), ('proposition', 0.023), ('marginal', 0.023), ('us', 0.022), ('proof', 0.022), ('whenever', 0.022), ('implies', 0.022), ('investigate', 0.022), ('numbers', 0.021), ('distribution', 0.02), ('constants', 0.02), ('yields', 0.02), ('end', 0.02), ('risk', 0.019), ('nition', 0.019), ('formula', 0.019), ('bayes', 0.019), ('max', 0.018), ('obviously', 0.018), ('write', 0.018), ('assume', 0.018), ('xn', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 101 nips-2007-How SVMs can estimate quantiles and the median
Author: Andreas Christmann, Ingo Steinwart
Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1
2 0.11845171 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
3 0.084267214 21 nips-2007-Adaptive Online Gradient Descent
Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett
Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1
4 0.067532137 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
5 0.064728506 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
6 0.063649744 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
7 0.056900889 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
8 0.048629411 7 nips-2007-A Kernel Statistical Test of Independence
9 0.046578165 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
10 0.044699885 205 nips-2007-Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity
11 0.04440248 199 nips-2007-The Price of Bandit Information for Online Optimization
12 0.043490943 22 nips-2007-Agreement-Based Learning
13 0.042904142 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
14 0.042112578 124 nips-2007-Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning
15 0.041712444 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
16 0.041336522 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation
17 0.041307814 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
18 0.039667778 123 nips-2007-Loop Series and Bethe Variational Bounds in Attractive Graphical Models
19 0.039288409 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
20 0.037910968 159 nips-2007-Progressive mixture rules are deviation suboptimal
topicId topicWeight
[(0, -0.106), (1, -0.036), (2, -0.036), (3, 0.063), (4, 0.016), (5, -0.045), (6, 0.024), (7, 0.001), (8, -0.035), (9, 0.011), (10, 0.07), (11, -0.02), (12, -0.012), (13, 0.004), (14, -0.029), (15, -0.112), (16, 0.137), (17, 0.024), (18, -0.008), (19, -0.006), (20, 0.05), (21, -0.045), (22, 0.035), (23, -0.152), (24, -0.069), (25, -0.011), (26, 0.059), (27, -0.066), (28, 0.048), (29, -0.0), (30, -0.093), (31, -0.066), (32, 0.001), (33, -0.0), (34, 0.007), (35, 0.089), (36, -0.156), (37, 0.026), (38, -0.1), (39, -0.107), (40, -0.154), (41, -0.021), (42, -0.033), (43, 0.0), (44, 0.037), (45, 0.024), (46, -0.019), (47, -0.075), (48, -0.06), (49, -0.092)]
simIndex simValue paperId paperTitle
same-paper 1 0.96634793 101 nips-2007-How SVMs can estimate quantiles and the median
Author: Andreas Christmann, Ingo Steinwart
Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1
2 0.6279667 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
3 0.51560563 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a variational characterization of f -divergences, which turns the estimation into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of the method, which compares favorably with existing methods in the literature. 1
4 0.45001656 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
5 0.43765432 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski
Abstract: Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a finite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee finite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory. Optimization is the general problem of finding a value of a vector of variables θ that maximizes (or minimizes) some scalar criterion U (θ). The set of all possible values of the vector θ is called the optimization domain. The elements of θ can be discrete or continuous variables. In the first case the optimization domain is usually finite, such as in the well-known traveling salesman problem; in the second case the optimization domain is a continuous set. An important example of a continuous optimization domain is the set of 3-D configurations of a sequence of amino-acids in the problem of finding the minimum energy folding of the corresponding protein [1]. In principle, any optimization problem on a finite domain can be solved by an exhaustive search. However, this is often beyond computational capacity: the optimization domain of the traveling salesman problem with 100 cities contains more than 10155 possible tours. An efficient algorithm to solve the traveling salesman and many similar problems has not yet been found and such problems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used methods for finding good approximate solutions in hard discrete optimization problems which defy efficient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a random search based on the Metropolis-Hastings algorithm, such that the distribution of the elements of the domain visited during the search converges to an equilibrium distribution concentrated around the global optimizers. Convergence and finite-time performance of simulated annealing on finite domains has been evaluated in many works, e.g. [7, 8, 9, 10]. On continuous domains, most popular optimization methods perform a local gradient-based search and in general converge to local optimizers; with the notable exception of convex criteria where convergence to the unique global optimizer occurs [11]. Simulated annealing performs a global search and can be easily implemented on continuous domains. Hence it can be considered a powerful complement to local methods. In this paper, we introduce for the first time rigorous guarantees on the finite-time performance of simulated annealing on continuous domains. We will show that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level of confidence, find an approximate solution to the problem of optimizing a function of continuous variables, within a specified tolerance to the global optimal solution after a known finite number of steps. Rigorous guarantees on the finite-time performance of simulated annealing in the optimization of functions of continuous variables have never been obtained before; the only results available state that simulated annealing converges to a global optimizer as the number of steps grows to infinity, e.g. [12, 13, 14, 15]. The background of our work is twofold. On the one hand, our notion of approximate solution to a global optimization problem is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory [16, 17]. We actually maintain an important aspect of statistical learning theory which is that we do not introduce any particular assumption on the optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground our results on the theory of convergence, with quantitative bounds on the distance to the target distribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods, which has been one of the main achievements of recent research in statistics [18, 19, 20, 21]. In this paper, we will not develop any ready-to-use optimization algorithm. We will instead introduce a general formulation of the simulated annealing method which allows one to derive new simulated annealing algorithms with rigorous finite-time guarantees on the basis of existing theory. The Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of freedom. The choice and comparison of specific algorithms goes beyond the scope of the paper. The paper is organized in the following sections. In Simulated annealing we introduce the method and fix the notation. In Convergence we recall the reasons why finite-time guarantees for simulated annealing on continuous domains have not been obtained before. In Finite-time guarantees we present the main result of the paper. In Conclusions we state our findings and conclude the paper. 1 Simulated annealing The original formulation of simulated annealing was inspired by the analogy between the stochastic evolution of the thermodynamic state of an annealing material towards the configurations of minimal energy and the search for the global minimum of an optimization criterion [3]. In the procedure, the optimization criterion plays the role of the energy and the state of the annealed material is simulated by the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves according to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of thermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of temperatures (“cooling”). The target distribution of the cooling procedure is the limiting Boltzmann distribution, for the temperature that tends to zero, which takes non-zero values only on the set of global minimizers [7]. The original formulation of the method was for a finite domain. However, simulated annealing can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings algorithm can be used with almost no differences on discrete and continuous domains The main difference is that on a continuous domain the equilibrium distributions are specified by probability densities. On a continuous domain, Markov transition kernels in which the distribution of the elements visited by the chain converges to an equilibrium distribution with the desired density can be constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods [22]. We point out that Boltzmann distributions are not the only distributions which can be adopted as equilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a different type of equilibrium distribution in place of Boltzmann distributions. 1.1 Our setting The optimization criterion is U : Θ → [0, 1], with Θ ⊂ RN . The assumption that U takes values in the interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any bounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimization task is to find a global maximizer; this can be done without loss of generality. We also assume that Θ is a bounded set. We consider equilibrium distributions defined by probability density functions proportional to [U (θ) + δ]J where J and δ are two strictly positive parameters. We use π (J) to denote an equilibrium distribution, i.e. π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ) where πLeb is the standard Lebesgue measure. Here, J −1 plays the role of the temperature: if the function U (θ) plus δ is taken to a positive power J then as J increases (i.e. as J −1 decreases) [U (θ) + δ]J becomes increasingly peaked around the global maximizers. The parameter δ is an offset which guarantees that the equilibrium densities are always strictly positive, even if U takes zero values on some elements of the domain. The offset δ is chosen by the user and we show later that our results allow one to make an optimal selection of δ. The zero-temperature distribution is the limiting distribution, for J → ∞, which takes non-zero values only on the set of global maximizers. It is denoted by π (∞) . In the generic formulation of the method, the Markov transition kernel of the k-th step of the inhomogeneous chain has equilibrium distribution π (Jk ) where {Jk }k=1,2,... is the “cooling schedule”. The cooling schedule is a non-decreasing sequence of positive numbers according to which the equilibrium distribution become increasingly sharpened during the evolution of the chain. We use θk to denote the state of the chain and Pθk to denote its probability distribution. The distribution Pθk obviously depends on the initial condition θ0 . However, in this work, we don’t need to make this dependence explicit in the notation. Remark 1: If, given an element θ in Θ, the value U (θ) can be computed directly, we say that U is a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems involving random variables, the value U (θ) may be the expected value U (θ) = g(x, θ)px (x; θ)dx of some function g which depends on both the optimization variable θ, and on some random variable x which has probability density px (x; θ) (which may itself depend on θ). In such problems it is usually not possible to compute U (θ) directly, either because evaluation of the integral requires too much computation, or because no analytical expression for px (x; θ) is available. Typically one must perform stochastic simulations in order to obtain samples of x for a given θ, hence obtain sample values of g(x, θ), and thus construct a Monte Carlo estimate of U (θ). The Bayesian design of clinical trials is an important application area where such expected-value criteria arise [23]. The authors of this paper investigate the optimization of expected-value criteria motivated by problems of aircraft routing [24]. In the particular case that px (x; θ) does not depend on θ, the optimization task is often called “empirical risk minimization”, and is studied extensively in statistical learning theory [16, 17]. The results of this paper apply in the same way to the optimization of both deterministic and expected-value criteria. The MCMC method developed by M¨ ller [25, 26] allows one u to construct simulated annealing algorithms for the optimization of expected-value criteria. M¨ ller u [25, 26] employs the same equilibrium distributions as those described in our setting; in his context J is restricted to integer values. 2 Convergence The rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J, then the distribution of the state of the chain Pθk tends to the equilibrium distribution π (J) ; if J → ∞ then the equilibrium distribution π (J) tends to the zero-temperature distribution π (∞) ; as a result, if the cooling schedule Jk tends to infinity, one obtains that Pθk “follows” π (Jk ) and that π (Jk ) tends to π (∞) and eventually that the distribution of the state of the chain Pθk tends to π (∞) . The theory shows that, under conditions on the cooling schedule and the Markov transition kernels, the distribution of the state of the chain Pθk actually converges to the target zero-temperature distribution π (∞) as k → ∞ [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that asymptotically the state of the chain eventually coincides with a global optimizer with probability one. The difficulty which must be overcome in order to obtain finite step results on simulated annealing algorithms on a continuous domain is that usually, in an optimization problem defined over continuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated points). If the set of global optimizers has zero measure then the set of global optimizers has null probability according to the equilibrium distributions π (J) for any finite J and, as a consequence, according to the distributions Pθk for any finite k. Put another way, the probability that the state of the chain visits the set of global optimizers is constantly zero after any finite number of steps. Hence the confidence of the fact that the solution provided by the algorithm in finite time coincides with a global optimizer is also constantly zero. Notice that this is not the case for a finite domain, where the set of global optimizers is of non-null measure with respect to the reference counting measure [7, 8, 9, 10]. It is instructive to look at the issue also in terms of the rate of convergence to the target zerotemperature distribution. On a discrete domain, the distribution of the state of the chain at each step and the zero-temperature distribution are both standard discrete distributions. It is then possible to define a distance between them and study the rate of convergence of this distance to zero. This analysis allows one to obtain results on the finite-time behavior of simulated annealing [7, 8]. On a continuous domain and for a set of global optimizers of measure zero, the target zero-temperature distribution π (∞) ends up being a mixture of probability masses on the set of global optimizers. In this situation, although the distribution of the state of the chain Pθk still converges asymptotically to π (∞) , it is not possible to introduce a sensible distance between the two distributions and a rate of convergence to the target distribution cannot even be defined (weak convergence), see [12, Theorem 3.3]. This is the reason that until now there have been no guarantees on the performance of simulated annealing on a continuous domain after a finite number of computations: by adopting the zero-temperature distribution π (∞) as the target distribution it is only possible to prove asymptotic convergence in infinite time to a global optimizer. Remark 2: The standard distance between two distributions, say µ1 and µ2 , on a continuous support is the total variation norm µ1 − µ2 T V = supA |µ1 (A) − µ2 (A)|, see e.g. [21]. In simulated annealing on a continuous domain the distribution of the state of the chain Pθk is absolutely continuous with respect to the Lebesgue measure (i.e. πLeb (A) = 0 ⇒ Pθk (A) = 0), by construction for any finite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero measure also according to Pθk . The set of global optimizers has however measure 1 according to π (∞) . The distance Pθk − π (∞) T V is then constantly 1 for any finite k. It is also worth mentioning that if the set of global optimizers has zero measure then asymptotic convergence to the zero-temperature distribution π (∞) can be proven only under the additional assumptions of continuity and differentiability of U [12, 13, 14, 15]. 3 Finite-time guarantees In general, optimization algorithms for problems defined on continuous variables can only find approximate solutions in finite time [27]. Given an element θ of a continuous domain how can we assess how good it is as an approximate solution to an optimization problem? Here we introduce the concept of approximate global optimizer to answer this question. The definition is given for a maximization problem in a continuous but bounded domain. We use two parameters: the value imprecision ǫ (greater than or equal to 0) and the residual domain α (between 0 and 1) which together determine the level of approximation. We say that θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if the function U takes values strictly greater than U (θ) + ǫ only on a subset of values of θ no larger than an α portion of the optimization domain. The formal definition is as follows. Definition 1 Let U : Θ → R be an optimization criterion where Θ ⊂ RN is bounded. Let πLeb denote the standard Lebesgue measure. Let ǫ ≥ 0 and α ∈ [0, 1] be given numbers. Then θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . In other words, the value U (θ) is within ǫ of a value which is greater than the values that U takes on at least a 1 − α portion of the domain. The smaller ǫ and α are, the better is the approximation of a true global optimizer. If both α and ǫ are equal to zero then U (θ) coincides with the essential supremum of U . Our definition of approximate global optimizer carries an important property, which holds regardless of what the criterion U is: if ǫ and α have non-zero values then the set of approximate global optimizers always has non-zero Lebesgue measure. It follows that the probability that the chain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study the confidence of the fact that the solution found by simulated annealing in finite time is an approximate global optimizer. Remark 3: The intuition that our notion of approximate global optimizer can be used to obtain formal guarantees on the finite-time performance of optimization methods based on a stochastic search of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28] introduces a similar definition and obtains rigorous finite-time guarantees in the optimization of ex- pected value criteria based on uniform independent sampling of the domain. Notably, the number of independent samples required to guarantee some desired accuracy and confidence turns out to be polynomial in the values of the desired imprecision, residual domain and confidence. Although the method of Vidyasagar is not highly sophisticated, it has had considerable success in solving difficult control system design applications [28, 29]. Its appeal stems from its rigorous finite-time guarantees which exist without the need for any particular assumption on the optimization criterion. Here we show that finite-time guarantees for simulated annealing can be obtained by selecting a distribution π (J) with a finite J as the target distribution in place of the zero-temperature distribution π (∞) . The fundamental result is the following theorem which allows one to select in a rigorous way δ and J in the target distribution π (J) . It is important to stress that the result holds universally for any optimization criterion U on a bounded domain. The only minor requirement is that U takes values in [0, 1]. Theorem 1 Let U : Θ → [0, 1] be an optimization criterion where Θ ⊂ RN is bounded. Let J ≥ 1 and δ > 0 be given numbers. Let θ be a multivariate random variable with distribution π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ). Let α ∈ (0, 1] and ǫ ∈ [0, 1] be given numbers and define σ= 1 1+δ 1+ ǫ+1+δ J 1 1+δ 1+δ −1 α ǫ+δ δ . (1) Then the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ. Proof. See Appendix A. The importance of the choice of a target distribution π (J) with a finite J is that π (J) is absolutely continuous with respect to the Lebesgue measure. Hence, the distance Pθk − π (J) TV between the distribution of the state of the chain Pθk and the target distribution π (J) is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm is a well studied problem. The theory provides simple conditions under which one derives upper bounds on the distance to the target distribution which are known at each step of the chain and decrease monotonically to zero as the number of steps of the chain grows. The theory has been developed mainly for homogeneous chains [18, 19, 20, 21]. In the case of simulated annealing, the factor that enables us to employ these results is the absolute continuity of the target distribution π (J) with respect to the Lebesgue measure. However, simulated annealing involves the simulation of inhomogeneous chains. In this respect, another important fact is that the choice of a target distribution π (J) with a finite J implies that the inhomogeneous Markov chain can in fact be formed by a finite sequence of homogeneous chains (i.e. the cooling schedule {Jk }k=1,2,... can be chosen to be a sequence that takes only a finite set of values). In turn, this allows one to apply the theory of homogeneous MCMC methods to study the convergence of Pθk to π (J) in total variation norm. On a bounded domain, simple conditions on the ‘proposal distribution’ in the iteration of the simulated annealing algorithm allows one to obtain upper bounds on Pθk − π (J) TV that decrease geometrically to zero as k → ∞, without the need for any additional assumption on U [18, 19, 20, 21]. It is then appropriate to introduce the following finite-time result. Theorem 2 Let the notation and assumptions of Theorem 1 hold. Let θk , with distribution Pθk , be the state of the inhomogeneous chain of a simulated annealing algorithm with target distribution π (J) . Then the statement “θk is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ − Pθk − π (J) TV . The proof of the theorem follows directly from the definition of the total variation norm. It follows that if simulated annealing is implemented with an algorithm which converges in total variation distance to a target distribution π (J) with a finite J, then one can state with confidence arbitrarily close to 1 that the solution found by the algorithm after the known appropriate finite number of steps is an approximate global optimizer with the desired approximation level. For given non-zero values of ǫ, α the value of σ given by (1) can be made arbitrarily close to 1 by choice of J; while the distance Pθk − π (J) TV can be made arbitrarily small by taking the known sufficient number of steps. It can be shown that there exists the possibility of making an optimal choice of δ and J in the target distribution π (J) . In fact, for given ǫ and α and a given value of J there exists an optimal choice of δ which maximizes the value of σ given by (1). Hence, it is possible to obtain a desired σ with the smallest possible J. The advantage of choosing the smallest J, consistent with the required approximation and confidence, is that it will decrease the number of steps required to achieve the desired reduction of Pθk − π (J) TV . 4 Conclusions We have introduced a new formulation of simulated annealing which admits rigorous finite-time guarantees in the optimization of functions of continuous variables. First, we have introduced the notion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed to find approximate global optimizers, with the desired confidence and the desired level of accuracy, in a known finite number of steps, if a proper choice of the target distribution is made and conditions for convergence in total variation norm are met. The results hold for any optimization criterion on a bounded domain with the only minor requirement that it takes values between 0 and 1. In this framework, simulated annealing algorithms with rigorous finite-time guarantees can be derived by studying the choice of the proposal distribution and of the cooling schedule, in the generic iteration of simulated annealing, in order to ensure convergence to the target distribution in total variation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and MCMC methods on continuous domains can be used [18, 19, 20, 21]. Vidyasagar [17, 28] has introduced a similar definition of approximate global optimizer and has shown that approximate optimizers with desired accuracy and confidence can be obtained with a number of uniform independent samples of the domain which is polynomial in the accuracy and confidence parameters. In general, algorithms developed with the MCMC methodology can be expected to be equally or more efficient than uniform independent sampling. Acknowledgments Work supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects HYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and D. M. Wolpert for discussions and useful comments on the paper. A Proof of Theorem 1 Let α ∈ (0, 1] and ρ ∈ (0, 1] be given numbers. Let Uδ (θ) := U (θ) + δ. Let πδ be a normalized ¯ measure such that πδ (dθ) ∝ Uδ (θ)πLeb (dθ). In the first part of the proof we find a lower bound on the probability that θ belongs to the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} . ¯ Let yα := inf{y : πδ {θ ∈ Θ : Uδ (θ) ≤ y} ≥ 1 − α}. To start with we show that the set ¯ ¯ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} coincides with {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice ¯ ¯ that the quantity πδ {θ ∈ Θ : Uδ (θ) ≤ y} is a right-continuous non-decreasing function of y because it has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we have πδ {θ ∈ Θ : Uδ (θ) ≤ yα } ≥ 1 − α and ¯ ¯ y ≥ ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} ≥ 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α . ¯ ¯ ¯ Moreover, y < ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} < 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} > α ¯ ¯ ¯ and taking the contrapositive one obtains πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α ⇒ y ≥ ρ yα . ¯ ¯ ′ Therefore {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α}. ¯ ¯ We now derive a lower bound on π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Let us introduce the notation ¯ ¯¯ Aα := {θ ∈ Θ : Uδ (θ) < yα }, Aα := {θ ∈ Θ : Uδ (θ) ≥ yα }, Bα,ρ := {θ ∈ Θ : Uδ (θ) < ρ yα } ¯ ¯ ¯ ¯ ¯ ¯α,ρ := {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice that Bα,ρ ⊆ Aα and Aα ⊆ Bα,ρ . The quantity ¯¯ ¯¯ and B ¯ ¯ ¯ ¯ πδ {θ ∈ Θ : Uδ (θ) < y} as a function of y is the left-continuous version of πδ {θ ∈ Θ : Uδ (θ) ≤ ¯¯ y}[30, p.162]. Hence, the definition of yα implies πδ (Aα ) ≤ 1 − α and πδ (Aα ) ≥ α. Notice that ¯ ¯ ¯ ¯ δπLeb (Aα ) ¯ ≤ 1−α, ¯ πδ (Aα ) ≤ 1 − α ⇒ ¯ ¯ Uδ (θ)πLeb (dθ) Θ ¯¯ πδ (Aα ) ≥ α ¯ ¯¯ (1 + δ)πLeb (Aα ) ≥ α. ¯ U (θ)πLeb (dθ) Θ δ ⇒ ¯¯ Hence, πLeb (Aα ) > 0 and πLeb (Aα ) 1−α1+δ ¯ ¯ . ¯α ) ≤ α ¯ δ πLeb (A ¯ ¯¯ ¯¯ Notice that πLeb (Aα ) > 0 implies πLeb (Bα,ρ ) > 0. We obtain π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα } = ¯ 1+ 1 ≥ Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Bα,ρ ≥ 1+ J ρ J yα ¯ J yα ¯ Uδ (θ)J πLeb (dθ) 1+ 1 Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Aα Uδ (θ)J πLeb (dθ) 1 1 1 ≥ ≥ . 1−α1+δ ¯ πLeb (Aα ) πLeb (Bα,ρ ) ¯ ¯ 1 + ρJ 1 + ρJ ¯¯ ¯¯ α ¯ δ πLeb (Aα ) πLeb (Aα ) Since {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} the first part of ¯ ¯ the proof is complete. In the second part of the proof we show that the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} is contained in the set of approximate global optimizers of U with value imprecision ¯ ǫ := (ρ−1 − 1)(1 + δ) and residual domain α := 1+δ α. Hence, we show that {θ ∈ Θ : πδ {θ′ ∈ ˜ ˜ ǫ+δ ¯ ˜ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} ⊆ {θ ∈ Θ : πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ)}. We ¯ ˜ ˜ have U (θ′ ) > U (θ) + ǫ ⇔ ρ Uδ (θ′ ) > ρ [Uδ (θ) + ǫ] ⇒ ρ Uδ (θ′ ) > Uδ (θ) ˜ ˜ which is proven by noticing that ρ [Uδ (θ) + ǫ] ≥ Uδ (θ) ⇔ 1 − ρ ≥ U (θ)(1 − ρ) ˜ and U (θ) ∈ [0, 1]. Hence {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ⊇ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} . ˜ Therefore πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α . Let ¯ ˜ ¯ Qθ,˜ := {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} and notice that ˜ ǫ U (θ′ )πLeb (dθ′ ) + δπLeb (Qθ,˜) ǫ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} = ˜ Qθ,˜ ǫ . U (θ′ )πLeb (dθ′ ) + δπLeb (Θ) Θ We obtain πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α ⇒ ǫ πLeb (Qθ,˜) + δπLeb (Qθ,˜) ≤ α(1 + δ)πLeb (Θ) ˜ ¯ ˜ ¯ ǫ ǫ ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . ˜ ˜ Hence we can conclude that πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) ¯ ˜ ˜ and the second part of the proof is complete. We have shown that given α ∈ (0, 1], ρ ∈ (0, 1], ǫ := (ρ−1 − 1)(1 + δ), α := ¯ ˜ ˜ σ := 1 = 1−α1+δ ¯ 1+δ 1+ρ 1+ α ¯ δ ǫ+1+δ ˜ J 1+δ ǫ+δ ˜ 1 J 1 1+δ 1+δ −1 α ǫ+δ ˜˜ δ α and ¯ , the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual ˜ domain α” holds with probability at least σ. Notice that ǫ ∈ [0, 1] and α ∈ (0, 1] are linked through ˜ ˜ ˜ ǫ+δ ˜ a bijective relation to ρ ∈ [ 1+δ , 1] and α ∈ (0, 1+δ ]. The statement of the theorem is eventually ¯ 2+δ obtained by expressing σ as a function of desired ǫ = ǫ and α = α. ˜ ˜ References [1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003. [2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization problems. Nature, 435:759–764, 2005. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 220(4598):671–680, 1983. Optimization by Simulated Annealing. Science, [4] E. Bonomi and J. Lutton. The N -city travelling salesman problem: statistical mechanics and the Metropolis algorithm. SIAM Rev., 26(4):551–568, 1984. [5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimization. J. Phys. A: Math. Gen., 19(9):1605–1620, 1986. [6] M. M´ zard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satisfiability e Problems. Science, 297:812–815, 2002. [7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel Publishing Company, Dordrecht, Holland, 1987. [8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and finite-time behavior of simulated annealing. Adv. Appl. Prob., 18:747–771, 1986. [9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311–329, 1988. [10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing. Math. Oper. Res., 31(1):199–216, 2006. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866– 893, 1991. [13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization. Algorithmica, 6:419–436, 1991. [14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395–406, 1996. [15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions. J. Optimiz. Theory App., 104(1):121–133, 2000. [16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New York, US, 1995. [17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag, London, second edition, 2003. [18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc., 90(430):558–566, 1995. [20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm. Ann. Stat., 24(1):101–121, 1996. [21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob. Surv., 1:20–71, 2004. [22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition, 2004. [23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons, Chichester, UK, 2004. [24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for Conflict Resolution in Air Traffic Control. IEEE Trans. Intell. Transp. Syst., 7(4):470–482, 2006. [25] P. M¨ ller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M. u Smith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages 459–474. Oxford: Clarendon Press, 1999. [26] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simuu o lation. J. Am. Stat. Assoc., 99(467):788–798, 2004. [27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New York, 1998. [28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37(10):1515–1528, 2001. [29] R. Tempo, G. Calafiore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain Systems. Springer-Verlag, London, 2005. [30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.
6 0.43051377 159 nips-2007-Progressive mixture rules are deviation suboptimal
7 0.41469508 7 nips-2007-A Kernel Statistical Test of Independence
8 0.40055335 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
9 0.38509938 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging
10 0.38341457 118 nips-2007-Learning with Transformation Invariant Kernels
11 0.36565948 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
12 0.3554796 62 nips-2007-Convex Learning with Invariances
13 0.35276639 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
14 0.31247357 193 nips-2007-The Distribution Family of Similarity Distances
15 0.3069016 129 nips-2007-Mining Internet-Scale Software Repositories
16 0.3064957 200 nips-2007-The Tradeoffs of Large Scale Learning
17 0.29843587 21 nips-2007-Adaptive Online Gradient Descent
18 0.28531986 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
19 0.28052002 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
20 0.28020436 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
topicId topicWeight
[(5, 0.021), (13, 0.031), (16, 0.032), (19, 0.013), (21, 0.044), (34, 0.024), (35, 0.05), (47, 0.047), (49, 0.033), (67, 0.403), (83, 0.13), (90, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.69156289 101 nips-2007-How SVMs can estimate quantiles and the median
Author: Andreas Christmann, Ingo Steinwart
Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1
2 0.48703894 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray
Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1
3 0.37630016 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
Author: Larry Wasserman, John D. Lafferty
Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1
4 0.3731409 40 nips-2007-Bundle Methods for Machine Learning
Author: Quoc V. Le, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ ) steps to precision for general convex problems and in O(log(1/ )) steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach. 1
5 0.37262341 156 nips-2007-Predictive Matrix-Variate t Models
Author: Shenghuo Zhu, Kai Yu, Yihong Gong
Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1
6 0.37231919 16 nips-2007-A learning framework for nearest neighbor search
7 0.37198457 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
8 0.37111121 21 nips-2007-Adaptive Online Gradient Descent
9 0.37045565 63 nips-2007-Convex Relaxations of Latent Variable Training
10 0.37042826 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
11 0.37028018 49 nips-2007-Colored Maximum Variance Unfolding
12 0.3692086 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
13 0.36890098 134 nips-2007-Multi-Task Learning via Conic Programming
14 0.36865592 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
15 0.3683154 116 nips-2007-Learning the structure of manifolds using random projections
16 0.36813876 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
17 0.36788404 175 nips-2007-Semi-Supervised Multitask Learning
18 0.36755246 46 nips-2007-Cluster Stability for Finite Samples
19 0.36744964 185 nips-2007-Stable Dual Dynamic Programming
20 0.36739159 200 nips-2007-The Tradeoffs of Large Scale Learning