jmlr jmlr2007 jmlr2007-20 knowledge-graph by maker-knowledge-mining

20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

Source: pdf

Author: Jean-Yves Audibert, Olivier Bousquet

Abstract: There exist many different generalization error bounds in statistical learning theory. Each of these bounds contains an improvement over the others for certain situations or algorithms. Our goal is, ﬁrst, to underline the links between these bounds, and second, to combine the different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester (1998), which is interesting for randomized predictions, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand (see Talagrand, 1996), in a way that also takes into account the variance of the combined functions. We also show how this connects to Rademacher based bounds. Keywords: statistical learning theory, PAC-Bayes theorems, generalization error bounds

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The way to perform this union bound optimally is now well mastered in the empirical ∗. [sent-20, score-0.142]

2 Another direction in which the error bounds can be improved is by using the variance of the functions in order to control the ﬂuctuations of the empirical error. [sent-30, score-0.152]

3 Our goal is to combine several of these improvements, bringing together the power of the majorizing measures as an optimal union bound technique and the power of the PAC-Bayesian bounds to handle randomized predictions efﬁciently in a way that is sensitive to the variance (localization) effect. [sent-34, score-0.418]

4 Such functions g, L and f will be respectively called prediction function, loss function and regret function. [sent-61, score-0.164]

5 To each pair of prediction functions g1 and g2 , we deﬁne the relative regret function f : Z → R deﬁned as f (z) = L(g1 (x), y) − L(g2 (x), y). [sent-63, score-0.164]

6 For (relative) regret functions, i=1 P f is often called the (relative) risk. [sent-67, score-0.142]

7 Typical examples of such a class are based on regret function or relative regret functions and derived from a set of prediction functions. [sent-70, score-0.306]

8 By slight extension, any set of measurable real-valued functions deﬁned on Z will be called a regret class. [sent-71, score-0.192]

9 Given a sequence of nested partitions (A j ), we can build a collection (S j ) j∈N of approximating subsets of F in the following way: for each j ∈ N, for each element A of A j , choose a unique element of F contained in A and deﬁne S j as the set of all chosen elements. [sent-77, score-0.166]

10 For two such measures, K(ρ, π) will denote their Kullback-Leibler divergence deﬁned as K(ρ, π) = ρ log dρ dρ := log ( f ) ρ(d f ), dπ dπ F Z when ρ is absolutely continuous with respect to π and K(ρ, π) = +∞ otherwise. [sent-82, score-0.274]

11 2 The Setting for Error Bounds Generalization error bounds give upper bounds on the true (i. [sent-87, score-0.14]

12 Another type of result that can be obtained is a relative risk bound where one compares the risk of the algorithm to the risk of a ﬁxed prediction function g. [sent-102, score-0.172]

13 ˜ (2) With the notation introduced above, denoting by f n the (relative) regret function associated to the predicition gn we can rewrite both previous inequalities as follows: P fn − Pn fn ≤ B(n, β) . [sent-107, score-0.361]

14 The corresponding regret function is not the average of the regrets of the individual functions which makes the analysis difﬁcult. [sent-113, score-0.164]

15 In order to avoid this caveat, it is common to ﬁrst study randomized estimators and then relate the randomized to the aggregate ones. [sent-114, score-0.155]

16 866 C OMBINING PAC-BAYESIAN AND G ENERIC C HAINING B OUNDS Hence, bounds for randomized estimators will have the form ρn P f ≤ ρn Pn f + B(n, β) , (4) where ρn is the speciﬁc distribution chosen by the algorithm based on the data. [sent-118, score-0.158]

17 For example if the algorithm picks functions in a class G whose associated regret class is F , statement (3) can be deduced from sup P f − Pn f ≤ B(n, β) . [sent-123, score-0.351]

18 From now on, the functions in the regret class F are assumed to be bounded. [sent-133, score-0.164]

19 The simplest form of the union bound gives that with probability at least 1 − β, log |F | + log(β−1 ) sup P f − Pn f ≤ C . [sent-148, score-0.431]

20 When the functions have values in {0; 1}, this is a ﬁnite set and the above union bound applies. [sent-159, score-0.141]

21 This idea was ﬁrst used in Vapnik and Chervonenkis (1971) to obtain that with probability at least 1 − β, sup P f − Pn f ≤ C f ∈F log E2n N(F , 1/2n, d2n ) + log(2β−1 ) . [sent-160, score-0.312]

22 The union bound can be reﬁned by considering ﬁnite covers of the set of function at different scales. [sent-175, score-0.142]

23 This is called the chaining technique, pioneered by Dudley (1984) since one constructs a chain of functions that approximate a given function more and more closely. [sent-176, score-0.189]

24 One has with probability at least 1 − β, sup P f − Pn f ≤ C f ∈F 1 √ En n Z ∞ 0 log N(F , ε, dn )dε + log(β−1 ) n . [sent-178, score-0.378]

25 Chained bounds are particularly useful when one has a tight control of the entropy numbers of the set F (see, for example, van der Vaart 1998 and van de Geer 2000). [sent-182, score-0.149]

26 It has been noticed by Fernique and Talagrand that it is possible to capture the complexity in a better way than using minimal covers by considering majorizing measures or generic chaining. [sent-184, score-0.177]

27 We can prove that with probability at least 1 − β, sup P f − Pn f ≤ C f ∈F 1 √ En sup n f ∈F ∞ ∑ r− j j=1 log{1/π( j) [A j ( f )]} + log(β−1 ) n . [sent-191, score-0.33]

28 1 If one takes partitions induced by minimal covers of F at radii r − j , and uniform measures concentrated at the centers of the balls, one has 1/π( j) [A j ( f )] = N(F , r− j , dn ) so that (9) leads to a sum of terms of the form r− j log N(F , r− j , dn ), which allows to recover (8). [sent-196, score-0.415]

29 1, from the nested partitions (A j ), we can build a collection (S j ) j∈N of approximating subsets of F in the following way: for each j ∈ N, for each element A j of A j , choose a unique element of F contained in A j and deﬁne S j as the set of all chosen elements. [sent-199, score-0.166]

30 It turns out that there are many ways to state the majorizing measure/generic chaining bound. [sent-208, score-0.248]

31 Each way involves a different geometric construction on the regret class F and a different notion of size, but most can be shown to be equivalent up to constant factors. [sent-209, score-0.142]

32 The general form of such bounds is sup ∑ F( f , i)G( f , i) f ∈F i≥i0 where F( f , i) is a measure of the scale of the geometric object of order i containing f , while G( f , i) is a measure of the size of this object. [sent-210, score-0.225]

33 This would give the standard majorizing measure bound (where all the π ( j) are the same). [sent-219, score-0.142]

34 Using such an approach actually allows to get rid of the measures π ( j) , the sum in the complexity term becoming ∑ d(A j ( f )) log |A j |. [sent-221, score-0.202]

35 The interested reader is referred to Talagrand (2005) for more information about the variants of the generic chaining bounds. [sent-223, score-0.216]

36 As we said before, the generic chaining bound is optimal up to constant factors. [sent-232, score-0.277]

37 More precisely, it is a tight upper bound for the quantity En sup P f − Pn f f ∈F . [sent-233, score-0.251]

38 It is also a lower bound for this quantity but up to a log n factor. [sent-234, score-0.233]

39 Hence the generic chaining bound is as good (up to this log) as another well-known quantity in the learning theory community, the Rademacher average of F : 1 ∑ σi f (Zi ) , f ∈F n En Eσ sup where the σi are independent random signs (+1, −1 with probability 1/2). [sent-235, score-0.487]

40 One has with probability at least 1 − β, sup P f − Pn f ≤ C f ∈F n 1 n E Eσ sup ∑ σi f (Zi ) + n f ∈F i=1 log(β−1 ) n . [sent-237, score-0.33]

41 Their result is that with probability at least 1 − β, ∀ f ∈ F , P f − Pn f ≤ C Pf log E2n N(F , 1/2n, d2n ) + log(4β−1 ) , n (12) which also gives with probability at least 1 − β, ∀ f ∈ F , P f − Pn f ≤ C √ Pn f log E2n N(F ,1/2n,d2n )+log(4β−1 ) n log E2n N(F ,1/2n,d2n )+log(4β−1 ) + n . [sent-252, score-0.451]

42 For a loss function L taking its values [0; 1], the variance of a regret function f can be bounded successively by P f 2 and P f . [sent-260, score-0.179]

43 The variance bounds already have this property but the bound only depends on Var f but not on where the function is in the space F . [sent-272, score-0.168]

44 The ﬁnite union bound can be directly extended to the countable case by introducing a probability distribution π over F which weights each function (McAllester, 1998) and gives, with probability at least 1 − β, ∀ f ∈ F , P f − Pn f ≤ C log 1/π( f ) + log(β−1 ) . [sent-275, score-0.34]

45 The following PAC-Bayes bound (McAllester, 1999) reﬁnes the above bound since it has the form (for possibly uncountable F ) ρn (P f − Pn f ) ≤ C K(ρn , π) + log n + log(β−1 ) . [sent-295, score-0.259]

46 To some extent, one can consider that the PAC-Bayes bound is a reﬁned union bound where the gain happens when ρ n is not concentrated on a single function (or more precisely ρn has entropy larger than log n). [sent-297, score-0.364]

47 For example, if one deﬁnes a function Π : Z 2n → M+ (F ) that is almost exchangeable in the sense that if we exchange the value of zi and zi+n we do not change the value of the function, then one gets, with probability at least 1 − β (over the random choice of a double sample), log 1/Π(Z1 , . [sent-308, score-0.365]

48 Converting this into an induction statement (comparing the empirical with the expected errors) gives with probability 1 − β, ∀ f ∈ F , P f − Pn f ≤ CE n log 1/Π(Z1 , . [sent-318, score-0.234]

49 If the following condition holds lim sup ∆n, j ( f ) = 0, j→+∞ f ∈F 874 P2n -a. [sent-389, score-0.155]

50 Theorem 3 (Induction) With the above notation, if the following condition holds n lim sup sup E ∆n, j ( f ) ≤ 0, j→+∞ f ∈F Pn -a. [sent-403, score-0.31]

51 Consider the class of regret functions F = {z → L[g(x), y] : g ∈ G ∪ {g}} . [sent-417, score-0.164]

52 1 Supremum Bounds We ﬁrst show that we can derive from Theorem 3 a result similar to the generic chaining bound (9). [sent-458, score-0.277]

53 Then we have [ρn ] j = δ p j ( fn ) and K j , deﬁned in (15), reduces to log{1/π( j) [A j ( fn )]} + log[ j( j + 1)β−1 ]. [sent-461, score-0.14]

54 We can take the partitions A j to have diameter 2− j so that sup f ∈F d2n [ f , p j ( f )] ≤ 2− j and thus d j ≤ 2− j+1 . [sent-462, score-0.229]

55 Proof As before, choose for ρn a distribution concentrated at a single function f n , then [ρn ] j = δ p j ( fn ) and thus K j = log 1/π( j) [A j ( fn )]. [sent-476, score-0.324]

56 It contains an optimal union bound, both in the sense of optimally taking into account the metric structure of the set of functions (via the majorizing measure approach) and in the sense of taking into account the averaging distribution. [sent-508, score-0.161]

57 Proof of Our Main Results The proof of our main results is inspired by previous work on the PAC-Bayesian bounds (Catoni, 2003), Audibert (2004b) and on the generic chaining (Talagrand, 1996). [sent-515, score-0.322]

58 The additional extra step that is required to obtain the generic chaining aspect is to decompose the functions into a chain (as explained in Section 2. [sent-521, score-0.238]

59 This chaining part is actually done in two steps: in the ﬁrst one (Section A. [sent-524, score-0.167]

60 4) we perform a non uniform union bound with appropriately chosen weights in order to make the bound independent on λ, while in the second one (Section A. [sent-526, score-0.18]

61 Namely, one has the following Lemma 7 (Legendre transform of the KL-divergence) For any π-measurable function h : F → R and any probability distribution ρ on F , ρh ≤ log πeh + K(ρ, π). [sent-534, score-0.157]

62 h e Proof We have ρh − K(ρ, π) − log πeh = −K ρ, πeh · π ≤ 0 with equality when ρ = eh πeh · π. [sent-535, score-0.242]

63 , Zn in a measurable way, we have for any β > 0, with probability at least 1 − β with respect to the samples distribution ρh ≤ log E2n πeh + K(ρ, π) + log(β−1 ) . [sent-542, score-0.185]

64 878 C OMBINING PAC-BAYESIAN AND G ENERIC C HAINING B OUNDS Also, with probability at least 1 − β with respect to the ﬁrst sample distribution, n n E ρh ≤ log E2n πeh + E K(ρ, π) + log(β−1 ) . [sent-543, score-0.157]

65 Proof From Markov’s inequality applied to the non-negative random variable πe h , we obtain that for any t > 0 P πeh > t ≤ 1 E2n πeh , hence for any β > 0, with probability at least 1−β with respect t to the samples distribution, log πeh ≤ log E2n πeh + log(β−1 ) . [sent-544, score-0.368]

66 The second assertion can be proved in a similar way (applying Markov’s inequality conditionally to the second sample) and using the inequality E n log En . [sent-546, score-0.285]

67 Using a n λk + λk ∗ , we get that, weighted union bound of (18), precisely using (18) for (λ, β) = (λ k , wk β) for k ∈ N with probability at least 1 − β, we have ρn ∆n, j ≤ B . [sent-580, score-0.187]

68 Taking λk = me and wk = k(k+1) , we are going to prove that we achieve this target up to a multiplicative constant and an additive log log term. [sent-584, score-0.302]

69 2 2e 4 n λ∗ 2ρn d 2 j ∗ n λk + 880 C OMBINING PAC-BAYESIAN AND G ENERIC C HAINING B OUNDS 1 The inequality λk∗ e− 2 ≤ λ∗ implies k∗ − 1 ≤ 2 log have proved that with probability at least 1 − β, √ 1 ρn ∆n, j ≤ 2 2e 4 ρn d 2 K j j n λ∗ m . [sent-590, score-0.231]

70 5 C HAINING Kj 2 nd j , hence k∗ + 1 ≤ log 4e2 ρ √ x log log 4e2 /x . [sent-594, score-0.411]

71 THE I NEQUALITIES 1 By simply using a union bound with weights equal to j( j+1) , the previous inequality holds4 uniformly in j ∈ N∗ provided that β is replaced with β/[ j( j + 1)], hence to apply the result of the previous section we need that β/2 ≤ e−1 . [sent-595, score-0.193]

72 Additional Material Lemma 11 For any random variable Z f which is π ⊗ P measurable we have log πeE[Z f ] ≤ E log πeZ f ≤ log πE eZ f . [sent-605, score-0.439]

73 881 AUDIBERT AND B OUSQUET Proof By duality (Lemma 7), we have ≥ sup ρn E [Z f ] − K(ρn , π) = log πeE[Z f ] . [sent-608, score-0.292]

74 E log πeZ f = E sup ρn Z f − K(ρn , π) ρn ρn This gives the ﬁrst inequality. [sent-609, score-0.292]

75 The second inequality follows from Jensen’s inequality (applied to the convex function − log) and Fubini’s theorem. [sent-610, score-0.148]

76 ∑ Xi > t ≤ e n i=1 We say that a function has the bounded differences property if for some constant c > 0 sup x1 ,. [sent-630, score-0.155]

77 For any t > 0 such that nt 2 ≥ 2(b − a)2 , Pn sup P f − Pn f ≥ t f ∈F ≤ 2P2n sup Pn f − Pn f ≥ f ∈F 882 t . [sent-657, score-0.352]

78 We have (P−Pn ) fn >t (P−Pn ) fn < 2 ≤ n t t (Pn −Pn ) f > 2 . [sent-659, score-0.14]

79 (b−a)2 nt 2 (P−Pn ) fn >t n 4 Var fn nt 2 ≤ t ≤ 2P [(Pn − Pn ) f > 2 ] and conclude by taking expectations w. [sent-666, score-0.245]

80 One can also slightly generalize the previous result to obtain: for any positive reals η and t, Pn sup P f − Pn f ≥ t f ∈F ≤ P2n sup Pn f − Pn f ≥ (1 − η)t f ∈F 1−e 2 2 nη t − 2 Var f +2(b−a)ηt/3 . [sent-670, score-0.31]

81 Lemma 17 (Symmetrization for expectations, Gin´ and Zinn 1984) For any set F of functions, e n 2 En sup P f − Pn f ≤ En Eσ sup ∑ σi f (Zi ). [sent-671, score-0.31]

82 n f ∈F f ∈F i=1 Proof We have = En sup E n [Pn f ] − Pn f En sup P f − Pn f f ∈F f ∈F ≤ E2n sup Pn f − Pn f = ≤ f ∈F 2n E sup 1 σ ( f (Z ) − E σ i n∑ i f ∈F 2 n n E Eσ sup ∑ σi f (Zi ). [sent-672, score-0.775]

83 Putting β = 2E 2n |FZ,Z |e− probability at least 1 − β, nt 2 8 ≤ e− ∀ f ∈ F , P f − Pn f ≤ √ 8 nt 2 8 sup b∈{−1;0;+1}n Pσ 1 n E2n |FZ,Z |, log P2n N(F , 1/n, d2n ) + log(2β−1 ) . [sent-703, score-0.396]

84 The probability that (13) does not hold is upper bounded with log 1/[βπ( f )] ≤ ∑ π( f )β = β. [sent-707, score-0.157]

85 5 P ROOF OF I NEQUALITY (11) To prove (11), we replace Hoeffding’s inequality with Bernstein’s inequality (see Section B. [sent-710, score-0.148]

86 Using Jensen’s inequality and Lemma 7, we get [ρn (P f − Pn f )+ ]2 ≤ ρn (P f − Pn f )2 ≤ + n 2 c log πe c (P f −Pn f )+ + K(ρn , π) . [sent-717, score-0.231]

87 Taking β = 1 + c = 1/2, with probability at least 1 − β, 1 ρn (P f − Pn f ) ≤ √ 2 1−e−n(2c−1)/c −t e , 2c−1 we obtain for K(ρn , π) + log(2n + 1) + log 1/β n and for any c > 1 , with probability at least 1 − β, 2 ρn (P f − Pn f ) ≤ √ c 2c K(ρn , π) + log( 2c−1 ) + log 1/β . [sent-719, score-0.314]

88 7 P ROOF OF I NEQUALITY (10) From Theorem 15 and Lemma 17, we obtain that with probability at least 1 − β, 2 1 sup (P f − Pn f ) ≤ En Eσ sup ∑ σi f (Zi ) + √ n 2 f ∈F f ∈F 885 log(β−1 ) . [sent-722, score-0.33]

89 We have Eσ sup ∑ σi f (Zi ) = Eσ ∑ σi h∗ i f ∈F + E σ ∑ σi h i − h i = E σ ∑ σi h ∗ − h i i (k) (K) (k−1) i,k ≤ 1 + E σ ∑ σi h i − h i (k) (k−1) i,k ≤ 1 + ∑ E σ ∑ σi h i − h i (k) k ≤ 1 + ∑ ek . [sent-738, score-0.183]

90 (k−1) i k ¯ From the following lemma, using that for any (h , h ) ∈ Nk , h − h we get ek ≤ 3 × 2−k 4n log N(F , 2−k , dn ). [sent-739, score-0.251]

91 λ λ 2 j √ Optimizing the parameter λ, the upper bound becomes c 2n log J. [sent-758, score-0.198]

92 Then we have √ ∑k ek ≤ 12 n ∑ 2−(k+1) log N(F , 2−k , dn ) √ R1 ≤ 12 n 0 log N(F , r, dn )dr. [sent-759, score-0.434]

93 To conclude, the chaining trick showed that Rademacher averages are bounded by the KoltchinskiiPollard integral. [sent-760, score-0.167]

94 We want to prove that the Rademacher averages are bounded with C sup f ∈F ∑∞ r− j log{1/π( j) [A j ( f )]}. [sent-765, score-0.155]

95 From Cauchy-Schwarz inequality, we have sup σ,Z1 ,. [sent-771, score-0.155]

96 Deﬁne the quantities a j ( f ) := r− j+1 2n log[2/π ( f )] and M := sup ∑ j≥1 a j [p j ( f )]. [sent-776, score-0.155]

97 We have f ∈F Eσ exp λ(X f − Xg ) = ∏n Eσ exp σi λ[ f (Zi ) − g(Zi )] i=1 = ∏n cosh{λ[ f (Zi ) − g(Zi )]} i=1 2 2 ≤ ∏n exp λ [ f (Zi )−g(Zi )] i=1 2 = e hence for any u > 0, Pσ (X f − Xg ≥ u) ≤ e − 2 nλ2 dn ( f ,g) 2 u2 2 2ndn ( f ,g) Pσ sup X f − X f0 ≥ uM ≤ f ∈F ≤ ≤ , . [sent-777, score-0.221]

98 Then for any u ≥ 1, we get ∑ j≥1,v∈S j Pσ Xv − X p j−1 (v) ≥ ua j (v) u2 a2 (v) j 2nDiam2 A j−1 (v) ∑ e − ∑ e −u2 log π 2 (v) j≥1,v∈S j j≥1,v∈S j 2 ≤ 21−u , since π (v)u ≤ π (v). [sent-778, score-0.157]

99 We obtain 2 Eσ sup X f = Eσ sup X f − X f0 ≤ f ∈F f ∈F R +∞ 1−u2 duM ≤ 2. [sent-779, score-0.31]

100 0 2 By plugging the deﬁnitions of X f and M, we obtain √ Eσ sup ∑i σi f (Zi ) ≤ 4 n sup ∑ j≥1 r− j+1 f ∈F From (19) and by using that desired result. [sent-781, score-0.31]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pn', 0.72), ('chaining', 0.167), ('haining', 0.163), ('sup', 0.155), ('ounds', 0.153), ('eneric', 0.151), ('ombining', 0.151), ('ousquet', 0.151), ('audibert', 0.145), ('regret', 0.142), ('log', 0.137), ('nequality', 0.116), ('zi', 0.113), ('eh', 0.105), ('rademacher', 0.105), ('zn', 0.1), ('talagrand', 0.089), ('roof', 0.089), ('majorizing', 0.081), ('inequality', 0.074), ('kj', 0.072), ('fn', 0.07), ('bounds', 0.07), ('randomized', 0.067), ('dn', 0.066), ('en', 0.065), ('nk', 0.063), ('bound', 0.061), ('exchangeable', 0.059), ('transduction', 0.059), ('union', 0.058), ('nested', 0.056), ('symmetrization', 0.056), ('concentration', 0.054), ('partitions', 0.052), ('generic', 0.049), ('concentrated', 0.047), ('lemma', 0.046), ('var', 0.045), ('mcallester', 0.044), ('countable', 0.044), ('nt', 0.042), ('inequalities', 0.041), ('boucheron', 0.04), ('probabilit', 0.039), ('radius', 0.039), ('gn', 0.038), ('risk', 0.037), ('variance', 0.037), ('paris', 0.037), ('balls', 0.036), ('double', 0.036), ('proof', 0.036), ('chervonenkis', 0.035), ('quantity', 0.035), ('der', 0.033), ('statement', 0.032), ('hoeffding', 0.031), ('bartlett', 0.031), ('xn', 0.031), ('atoires', 0.03), ('catoni', 0.03), ('massart', 0.03), ('element', 0.029), ('wk', 0.028), ('supremum', 0.028), ('localized', 0.028), ('ek', 0.028), ('measurable', 0.028), ('ee', 0.026), ('panchenko', 0.026), ('cosh', 0.026), ('sj', 0.026), ('logarithmic', 0.025), ('laboratoire', 0.024), ('xa', 0.024), ('measures', 0.024), ('van', 0.023), ('ependent', 0.023), ('fernique', 0.023), ('koltchinskiipollard', 0.023), ('llester', 0.023), ('empirical', 0.023), ('covers', 0.023), ('vapnik', 0.023), ('diameter', 0.022), ('vaart', 0.022), ('induction', 0.022), ('functions', 0.022), ('theorem', 0.022), ('coming', 0.021), ('rid', 0.021), ('nets', 0.021), ('expectations', 0.021), ('estimators', 0.021), ('remark', 0.021), ('probability', 0.02), ('get', 0.02), ('combine', 0.02), ('expectation', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

Author: Jean-Yves Audibert, Olivier Bousquet

2 0.22951017 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

Abstract: We introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efﬁciency criterion, on the basis of speciﬁed classiﬁcation or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which distributes its mass over the more efﬁcient, or informative, components. We implement a stochastic gradient descent algorithm, using the probability as a state variable and optimizing a multi-task goodness of ﬁt criterion for classiﬁers based on variable randomly chosen according to P. We then generate classiﬁers from the optimal distribution of weights learned on the training set. The method is ﬁrst tested on several pattern recognition problems including face detection, handwritten digit recognition, spam classiﬁcation and micro-array analysis. We then compare our approach with other step-wise algorithms like random forests or recursive feature elimination. Keywords: stochastic learning algorithms, Robbins-Monro application, pattern recognition, classiﬁcation algorithm, feature selection

3 0.13690047 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin

Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classiﬁer achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classiﬁer achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassiﬁcations. Keywords: set covering machines, sample-compression, loss bounds

4 0.10211773 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

Author: Avrim Blum, Yishay Mansour

Abstract: External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modiﬁed online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efﬁcient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modiﬁcation rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modiﬁcation rule is the difference between the cost of the online algorithm and the cost of the modiﬁed online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. Keywords: online learning, internal regret, external regret, multi-arm bandit, sleeping experts, reductions

5 0.091224894 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

Author: Yiming Ying, Ding-Xuan Zhou

Abstract: Gaussian kernels with ﬂexible variances provide a rich family of Mercer kernels for learning algorithms. We show that the union of the unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with ﬂexible variances is a uniform Glivenko-Cantelli (uGC) class. This result conﬁrms a conjecture concerning learnability of Gaussian kernels and veriﬁes the uniform convergence of many learning algorithms involving Gaussians with changing variances. Rademacher averages and empirical covering numbers are used to estimate sample errors of multi-kernel regularization schemes associated with general loss functions. It is then shown that the regularization error associated with the least square loss and the Gaussian kernels can be greatly improved when ﬂexible variances are allowed. Finally, for regularization schemes generated by Gaussian kernels with ﬂexible variances we present explicit learning rates for regression with least square loss and classiﬁcation with hinge loss. Keywords: Gaussian kernel, ﬂexible variances, learning theory, Glivenko-Cantelli class, regularization scheme, empirical covering number

6 0.069418952 9 jmlr-2007-AdaBoost is Consistent

7 0.06772083 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

8 0.06560351 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

9 0.060130626 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm

10 0.059936892 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

11 0.054747779 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

12 0.053272147 70 jmlr-2007-Ranking the Best Instances

13 0.052614637 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

14 0.048724685 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

15 0.043855857 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

16 0.040912431 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

17 0.040730081 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring

18 0.039308771 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

19 0.038147151 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

20 0.035540473 90 jmlr-2007-Value Regularization and Fenchel Duality

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.242), (1, -0.173), (2, -0.068), (3, -0.13), (4, 0.092), (5, 0.083), (6, 0.077), (7, -0.184), (8, 0.363), (9, 0.289), (10, -0.084), (11, 0.187), (12, 0.227), (13, 0.044), (14, -0.056), (15, 0.07), (16, 0.208), (17, 0.039), (18, -0.079), (19, 0.134), (20, 0.026), (21, -0.094), (22, -0.049), (23, 0.021), (24, -0.024), (25, -0.072), (26, -0.016), (27, -0.06), (28, 0.043), (29, -0.077), (30, -0.019), (31, -0.036), (32, -0.066), (33, -0.055), (34, 0.058), (35, 0.005), (36, -0.007), (37, 0.036), (38, 0.057), (39, -0.019), (40, 0.021), (41, -0.043), (42, 0.082), (43, -0.071), (44, -0.061), (45, 0.117), (46, -0.072), (47, -0.004), (48, -0.018), (49, -0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96911162 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

Author: Jean-Yves Audibert, Olivier Bousquet

2 0.71708894 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

Author: Sébastien Gadat, Laurent Younes

3 0.51734233 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data

Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin

4 0.37268692 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

Author: Avrim Blum, Yishay Mansour

5 0.27542073 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

Author: Ambuj Tewari, Peter L. Bartlett

Abstract: Binary classiﬁcation is a well studied special case of the classiﬁcation problem. Statistical properties of binary classiﬁers, such as consistency, have been investigated in a variety of settings. Binary classiﬁcation methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classiﬁcation method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufﬁcient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. Keywords: multiclass classiﬁcation, consistency, Bayes risk

6 0.26096192 45 jmlr-2007-Learnability of Gaussians with Flexible Variances

7 0.24633557 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

8 0.22160842 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

9 0.21520759 9 jmlr-2007-AdaBoost is Consistent

10 0.20667595 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

11 0.2026509 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm

12 0.1972979 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

13 0.18942563 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

14 0.18580541 70 jmlr-2007-Ranking the Best Instances

15 0.1856835 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

16 0.18456972 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

17 0.18397821 38 jmlr-2007-Graph Laplacians and their Convergence on Random Neighborhood Graphs (Special Topic on the Conference on Learning Theory 2005)

18 0.17736636 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

19 0.15689832 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data (Special Topic on the Conference on Learning Theory 2005)

20 0.15306069 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.011), (8, 0.038), (10, 0.022), (12, 0.036), (13, 0.322), (15, 0.013), (22, 0.011), (28, 0.082), (40, 0.047), (45, 0.017), (48, 0.037), (60, 0.083), (80, 0.011), (85, 0.05), (98, 0.144)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7358892 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

Author: Jean-Yves Audibert, Olivier Bousquet

2 0.51187629 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér

Abstract: We analyze two different feature selection problems: ﬁnding a minimal feature set optimal for classiﬁcation (MINIMAL - OPTIMAL) vs. ﬁnding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classiﬁcation, Markov blanket, bioinformatics

3 0.50273901 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

Author: Philippe Rigollet

Abstract: We consider semi-supervised classiﬁcation when part of the available data is unlabeled. These unlabeled data can be useful for the classiﬁcation problem when we make an assumption relating the behavior of the regression function to that of the marginal distribution. Seeger (2000) proposed the well-known cluster assumption as a reasonable one. We propose a mathematical formulation of this assumption and a method based on density level sets estimation that takes advantage of it to achieve fast rates of convergence both in the number of unlabeled examples and the number of labeled examples. Keywords: semi-supervised learning, statistical learning theory, classiﬁcation, cluster assumption, generalization bounds

4 0.50139314 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR

Author: Francesco Dinuzzo, Marta Neve, Giuseppe De Nicolao, Ugo Pietro Gianazza

Abstract: Support Vector Regression (SVR) for discrete data is considered. An alternative formulation of the representer theorem is derived. This result is based on the newly introduced notion of pseudoresidual and the use of subdifferential calculus. The representer theorem is exploited to analyze the sensitivity properties of ε-insensitive SVR and introduce the notion of approximate degrees of freedom. The degrees of freedom are shown to play a key role in the evaluation of the optimism, that is the difference between the expected in-sample error and the expected empirical risk. In this way, it is possible to deﬁne a C p -like statistic that can be used for tuning the parameters of SVR. The proposed tuning procedure is tested on a simulated benchmark problem and on a real world problem (Boston Housing data set). Keywords: statistical learning, reproducing kernel Hilbert spaces, support vector machines, representer theorem, regularization theory

5 0.50131297 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling

Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire

Abstract: We present a uniﬁed and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and uniﬁes techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling

6 0.50026333 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

7 0.49951369 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

8 0.49849176 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)

9 0.49843171 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

10 0.49789172 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

11 0.49621823 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

12 0.48994878 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

13 0.48732215 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

14 0.48623693 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

15 0.48576269 43 jmlr-2007-Integrating Naïve Bayes and FOIL

16 0.48477462 71 jmlr-2007-Refinable Kernels

17 0.48376679 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels

18 0.48220316 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

19 0.48216623 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions (Special Topic on the Conference on Learning Theory 2005)

20 0.48028189 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions