jmlr jmlr2006 jmlr2006-84 knowledge-graph by maker-knowledge-mining

84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes


Source: pdf

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Center for Biological and Computational Learning Massachusetts Institute of Technology Cambridge, MA, 02139, USA Editor: Leslie Pack Kaelbling Abstract We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. [sent-4, score-0.237]

2 We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. [sent-5, score-0.192]

3 Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. [sent-6, score-0.159]

4 Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. [sent-7, score-0.243]

5 This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. [sent-8, score-0.106]

6 Keywords: empirical risk minimization, empirical processes, stability, Donsker classes 1. [sent-10, score-0.201]

7 Introduction The empirical risk minimization (ERM) algorithm has been studied in learning theory to a great extent. [sent-11, score-0.099]

8 Tools from empirical process theory have been successfully applied, and, in particular, it has been shown that the localized Rademacher averages play an important role in studying the behavior of the ERM algorithm. [sent-15, score-0.107]

9 Algorithmic stability has been studied in the recent years as an alternative to the classical (complexity-oriented) approach to deriving generalization bounds (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Mukherjee et al. [sent-19, score-0.135]

10 Motivation for studying algorithmic stability comes, in part, from the work of Devroye and Wagner (1979). [sent-23, score-0.183]

11 In fact we had to turn to empirical process theory for the mathematical tools necessary for studying stability of ERM. [sent-32, score-0.242]

12 (1998): the difficult learning problems seem to be the ones where two minimizers of the expected error exist and are far apart. [sent-35, score-0.121]

13 Although the present paper does not address the question of performance rates, it does shed some light on the behavior of ERM when two (or more) minimizers of expected error exist. [sent-36, score-0.121]

14 Our results imply that, under a certain weak condition on the class, as the expected performance of empirical minimizers approaches the best in the class with the addition of new samples, a jump to a different part of the function class becomes less and less likely. [sent-37, score-0.221]

15 Since ERM minimizes empirical error instead of expected error, it is reasonable to require that the two quantities become close uniformly over the class, as the number of examples grows. [sent-38, score-0.113]

16 Hence, ERM is a sound strategy only if the function class is uniform Glivenko-Cantelli, that is, it satisfies the uniform law of large numbers. [sent-39, score-0.102]

17 In particular, uniform Donsker and uniform Glivenko-Cantelli properties are equivalent in the case of binary-valued functions (and also equivalent to finiteness of VC dimension). [sent-43, score-0.102]

18 The central limit theorem for Donsker classes states a form of convergence of the empirical process to a Gaussian process with a specific covariance structure (see for example, Dudley, 1999; van der Vaart and Wellner, 1996). [sent-44, score-0.392]

19 This structure is used in the proof of the main result of the paper to control the correlation of the empirical errors of ERM minimizers on similar samples. [sent-45, score-0.177]

20 Section 6 combines the results of Sections 4 and 5 and uses a uniform ratio limit theorem to obtain fast rates of decay on the deviations of expected errors of almost-ERM solutions, thus establishing strong expected error stability of ERM (see Mukherjee et al. [sent-51, score-0.404]

21 n i=1 Definition 1 Given a sample S, fS := argmin Pn f = argmin f ∈F f ∈F 1 n ∑ f (Zi ) n i=1 is a minimizer of the empirical risk (empirical error), if the minimum exists. [sent-68, score-0.099]

22 Since an exact minimizer of the empirical risk might not exist, as well as for algorithmic reasons, we consider the set of almost-minimizers of empirical risk. [sent-69, score-0.187]

23 Definition 2 Given ξ ≥ 0 and S, define the set of almost empirical minimizers MSξ = { f ∈ F : Pn f − inf Pn g ≤ ξ} g∈F and define its diameter as ξ diamMS = sup ξ f ,g∈MS f −g . [sent-70, score-0.624]

24 Here we mention a few known results (see van der Vaart and Wellner 1996, Equation 2. [sent-81, score-0.137]

25 3) in terms of entropy and entropy with bracketing, which we define below (see van der Vaart and Wellner, 1996). [sent-84, score-0.257]

26 Definition 5 The covering number N (ε, F , · ) is the minimal number of balls {g : g − f < ε} of radius ε needed to cover the set F . [sent-85, score-0.113]

27 The entropy is the logarithm of the covering number. [sent-87, score-0.105]

28 The entropy with bracketing is the logarithm of the bracketing number. [sent-92, score-0.144]

29 Proposition 8 If the envelope F of F is square integrable and Z ∞ 0 sup log N (ε F Q,2 , F Q , L2 (Q))dε < ∞, then F is P-Donsker for every P, that is, F is a universal Donsker class. [sent-94, score-0.361]

30 1, Gin´ and Zinn (1991) show that under the Pollard’s entropy condition, the {0, 1}e valued class F is in fact uniform Donsker. [sent-102, score-0.111]

31 Definitions and results on various types of convergence, as well as ways to deal with measurability issues arising in the proofs, are based on the rigorous book of van der Vaart and Wellner (1996). [sent-109, score-0.163]

32 Corollary 11 The result of Theorem 10 holds if the diameter is defined with respect to the L 2 (P) norm. [sent-111, score-0.107]

33 1 in van der Vaart and Wellner (1996)) Let (Z , A , P) be a probability space. [sent-128, score-0.137]

34 → The proof of Theorem 10 relies on the almost sure representation theorem (van der Vaart and Wellner, 1996, Theorem 1. [sent-132, score-0.137]

35 Then, if F is P-Donsker, the following inequality holds ξ Pr∗ diamMS > C ≤ N (ε, F , · )2 128δ + Pr∗ sup νn − ν ≥ δ/2 C3 F . [sent-147, score-0.277]

36 3 in van der Vaart and Wellner (1996) shows that when the limiting process is Borel measurable, almost uniform convergence implies convergence in outer probability. [sent-151, score-0.211]

37 Therefore, the first implication of Proposition 13 states that for any δ > 0 Pr∗ sup |νn − ν | > δ F → 0. [sent-152, score-0.277]

38 By Lemma 14, ξ(n) Pr∗ diamMS > C ≤ N (ε, F , · )2 128δ + Pr∗ sup νn − ν ≥ δ/2 C3 F √ for any C > 0, ε = min(C 3 /128,C/4), and any δ ≥ ξ(n) n. [sent-153, score-0.277]

39 In fact, since binary-valued function classes are uniform Donsker if and only if the VC dimension is finite, Corollary 15 proves that almost-ERM over binary VC classes possesses L 1 -stability. [sent-160, score-0.131]

40 Use of L1 -stability goes back to Devroye and Wagner (1979), who showed that this stability is sufficient to bound the difference between the leave-one-out error and the expected error of a learning algorithm. [sent-162, score-0.192]

41 1 Mn 1 ∑ fn (Zi ) ≤ n + n ∑ fn (Zi ) n i∈[n] i∈In ≤ Mn |In | + n n ≤ Mn |In | 1 + ξ (n) + ∑ fn (Zi ) n n n i∈In 1 ∑ g(Zi ) g∈F |In | i∈I n ξ (n) + inf ≤2 Mn |In | 1 + ξ (n) + ∑ fn (Zi ) n n n i∈[n] ≤2 Mn |In | 1 + ξ (n) + ξ(n) + inf ∑ g(Zi ). [sent-177, score-0.344]

42 Rates of Decay of diamMS ξ(n) The statement of Lemma 14 reveals that the rate of the decay of the diameter diamM S is related to the rate at which Pr∗ supF |ν − νn | ≥ δ → 0 for a fixed δ. [sent-182, score-0.183]

43 The following theorem shows that for KMT classes fulfilling a suitable entropy condition, it is possible to give explicit rates of decay for the diameter of ERM almost-minimizers. [sent-188, score-0.341]

44 Since F ∈ KMT (P; n−α ), Pr∗ sup |ν(n) − νn | ≥ n−α (t + K log n) F ≤ Λe−θt for any t > 0, choosing t = nα δ(n)/2 − K log n we obtain Pr∗ sup |ν(n) − νn | ≥ δ(n)/2 F 2572 ≤ Λe−θ(n α−β /2−K log n) . [sent-201, score-0.629]

45 The entropy condition in Theorem 17 is clearly verified by VC-subgraph classes of dimension V . [sent-204, score-0.1]

46 In fact, since L2 norm dominates · seminorm, upper bounds on L2 covering numbers of VCsubgraph classes induce analogous bounds on · covering numbers. [sent-205, score-0.13]

47 Expected Error Stability of almost-ERM In the previous section, we proved bounds on the rate of decay of the diameter of almost-minimizers. [sent-212, score-0.183]

48 In this section, we show that given such a bound, as well as some additional conditions on the class, the differences between expected errors of almost-minimizers decay faster than n −1/2 . [sent-213, score-0.105]

49 This implies a form of strong expected error stability for ERM. [sent-214, score-0.164]

50 Then |Pn f − P f | > 26 ε(Pn | f | + P| f |) + 5γ f ∈G Pr∗ sup ≤ 32N (γ, G ) exp(−nεγ). [sent-218, score-0.277]

51 The next theorem gives explicit rates for expected error stability of ERM over VC-subgraph classes fulfilling a KMT type condition. [sent-219, score-0.262]

52 2573 C APONNETTO AND R AKHLIN Theorem 20 If F is a VC-subgraph class with VC-dimension V , F ∈ KMT (P; n −α ) and 1 o(n−η ), then for any κ < min 6(2V +1) min(α, η), 1/2 n1/2+κ √ nξ(n) = P∗ sup ξ(n) f , f ∈MS |P( f − f )| − 0. [sent-220, score-0.277]

53 Conclusions We presented some new results establishing stability properties of ERM over certain classes of functions. [sent-222, score-0.175]

54 This property follows directly from the main result of the paper which shows decay (in probability) of the diameter of the set of solutions of almost1 ERM with tolerance function ξ(n) = o(n− 2 ). [sent-227, score-0.206]

55 In the perspective of possible algorithmic applications, we analyzed some additional assumptions implying uniform rates on the decay of the L1 diameter of almost-minimizers. [sent-232, score-0.283]

56 Results of this paper can be used to analyze stability of a class of clustering algorithms by casting them in the empirical risk minimization framework (see Rakhlin and Caponnetto, 2006). [sent-236, score-0.234]

57 For example, in the context of on-line learning, when a point is added to the training set, with high probability one would only have to search for empirical minimizers in a small L1 -ball around the current hypothesis, which might be a tractable problem. [sent-238, score-0.154]

58 Then for any ε ≤ 128 f0 inf h − sup h ≥ B ( f0 ,ε) B ( f1 ,ε) C2 . [sent-268, score-0.363]

59 16 Proof ∆ := inf h − sup h B ( f0 ,ε) B ( f1 ,ε) = h( f0 ) − h( f1 ) + inf{h( f − f0 ) + h( f1 − f )| f ∈ B ( f0 , ε), f ∈ B ( f1 , ε)} ≥ h( f0 ) − h( f1 ) − 2ε 8ε ≥ h( f0 ) − h( f1 ) − , C f0 since f0 ≥ C/4. [sent-269, score-0.363]

60 Then for all δ > 0 Pr∗ | sup νµ − sup νµ | ≤ δ B ( f0 ,ε) B ( f1 ,ε) ≤ C3 128 . [sent-275, score-0.554]

61 Define Γi (z) = sup {Y (·) + h(·)z} with i = 0, 1. [sent-280, score-0.277]

62 B ( fi ,ε) Notice that Pr∗ | sup νµ − sup νµ | ≤ δ|Y B ( f0 ,ε) B ( f1 ,ε) = Pr∗ (|Γ0 (νµ ( f0 )) − Γ1 (νµ ( f0 ))| ≤ δ) . [sent-281, score-0.554]

63 Moreover Γ0 and Γ1 are convex and inf ∂− Γ0 − sup ∂+ Γ1 ≥ inf h − sup h ≥ B ( f0 ,ε) B ( f1 ,ε) C2 , 16 by Lemma 21. [sent-282, score-0.726]

64 The reasoning in the proof of the next lemma goes as follows. [sent-285, score-0.099]

65 They belong to two covering balls with centers far apart. [sent-288, score-0.135]

66 Because the two almost-minimizers belong to these balls, the infima of the empirical risks over these two balls are close. [sent-289, score-0.13]

67 This is translated into the event that the suprema of the shifted empirical process over these two balls are close. [sent-290, score-0.183]

68 By looking at the Gaussian limit process, we are able to exploit the covariance structure to show that the suprema of the Gaussian process over balls with centers far apart are unlikely to be close. [sent-291, score-0.195]

69 Such a covering exists because F is totally ξ bounded in · norm (see page 89, van der Vaart and Wellner, 1996). [sent-296, score-0.182]

70 f − f > C, there exist k and l such that f − fk ≤ ε ≤ C/4, f − fl ≤ ε ≤ C/4. [sent-299, score-0.24]

71 By triangle inequality it follows that fk − fl ≥ C/2. [sent-300, score-0.24]

72 Moreover inf Pn ≤ inf Pn ≤ Pn f ≤ inf Pn + ξ F F B ( fk ,ε) and inf Pn ≤ inf Pn ≤ Pn f ≤ inf Pn + ξ. [sent-301, score-0.631]

73 F F B ( fl ,ε) Therefore, inf Pn − inf Pn ≤ ξ. [sent-302, score-0.297]

74 B ( fk ,ε) B ( fl ,ε) The last relation can be restated in terms of the empirical process ν n : √ √ √ sup {−νn − nP} − sup {−νn − nP} ≤ ξ n ≤ δ. [sent-303, score-0.879]

75 B ( fl ,ε) B ( fk ,ε) ξ ξ Pr∗ diamMS > C = Pr∗ ∃ f , f ∈ MS , f − f Pr∗ ∃l, k s. [sent-304, score-0.24]

76 fk − fl ≥ C/2, >C ≤ √ √ sup {−νn − nP} − sup {−νn − nP} ≤ δ . [sent-306, score-0.794]

77 B ( fk ,ε) B ( fl ,ε) By union bound ξ Pr∗ diamMS > C ≤ N (ε,F , · ) ∑ k,l=1 fk − fl ≥C/2 Pr∗ √ √ sup {−νn − nP} − sup {−νn − nP} ≤ δ . [sent-307, score-1.034]

78 B ( fk ,ε) B ( fl ,ε) 2577 C APONNETTO AND R AKHLIN We now want to bound the terms in the sum above. [sent-308, score-0.24]

79 The expected errors of almost-minimizers over a Glivenko-Cantelli (and therefore over Donsker) class are close because empirical averages uniformly converge to the expectations. [sent-314, score-0.113]

80 Let MSξ(n) be defined as above with ξ(n) = o(n−1/2 ) and assume that for some sequence of positive numbers λ(n) = o(n1/2 ) λ(n) P∗ sup ξ(n) f , f ∈MS P| f − f | − 0. [sent-329, score-0.277]

81 2 Then  √ Pr∗  n sup ξ(n) f , f ∈MS (2)  √ |P( f − f )| ≤ nξ(n) + 131λ(n)ρ−1  → 0. [sent-331, score-0.277]

82 7 of van der Vaart and Wellner (1996), G = (F ) + (−F ) and G = |G | ⊆ (G ∧ 0) ∨ (−G ∧ 0) are Donsker as well. [sent-335, score-0.137]

83 Applying Proposition 19 to the class G , we obtain Pr∗ |Pn ( f − f ) − P( f − f )| > 26 ∈F ε(Pn | f − f | + P| f − f |) + 5γ sup f,f ≤ 32N (γ/2, F )2 exp(−nεγ). [sent-337, score-0.277]

84 ξ(n) The inequality therefore holds if the sup is taken over a smaller (random) subclass M S . [sent-338, score-0.277]

85 Pr∗  sup ξ(n) ε(Pn | f − f | + P| f − f |) + 5γ f , f ∈MS A(x) A(x) Since supx B(x) ≥ supx sup B(x) = x  Pr∗  sup f,f ξ(n) ∈MS supx A(x) supx B(x) , |P( f − f )| − ξ(n) > 26 sup f,f ξ(n) ∈MS 2579  ε(Pn | f − f | + P| f − f |) + 5γ  ≤ 32N (γ/2, F )2 exp(−nεγ). [sent-340, score-1.276]

86 (3) C APONNETTO AND R AKHLIN By assumption, λ(n) P∗ sup ξ(n) f , f ∈MS P| f − f | − 0. [sent-341, score-0.277]

87 → Because G is Donsker and λ(n) = o(n1/2 ), λ(n) P∗ sup ξ(n) f , f ∈MS Pn | f − f | − P| f − f | − 0. [sent-342, score-0.277]

88 → Thus, λ(n) P∗ sup ξ(n) f , f ∈MS Pn | f − f | + P| f − f | − 0. [sent-343, score-0.277]

89 → Letting ε = ε(n) := n−1/2 λ(n)ρ , this implies that for any δ > 0, there exist Nδ such that for all n > Nδ ,   √ Pr∗  n sup 26ε(n) Pn | f − f | + P| f − f | > λ(n)ρ−1  < δ. [sent-344, score-0.277]

90 ξ(n) f , f ∈MS Now, choose γ = γ(n) := n−1/2 λ(n)ρ−1 (note that since ρ < 1, eventually 0 < γ(n) < 1), the last inequality can be rewritten in the following form  √ Pr∗  n sup f,f ξ(n) ∈MS  26 ε(n) Pn | f − f | + P| f − f | + 5γ(n) > 131λ(n)ρ−1  < δ. [sent-345, score-0.277]

91 Combining the relation above with Equation 3,  √ Pr∗  n sup ξ(n) f , f ∈MS ≥ 1 − 32N  √ |P( f − f )| ≤ nξ(n) + 131λ(n)ρ−1  1 −1/2 n λ(n)ρ−1 , F 2 2 exp(−λ(n)2ρ−1 ) − δ. [sent-346, score-0.277]

92 First, we show that a power decay of the · diameter implies the same rate 2580 S TABILITY P ROPERTIES OF E MPIRICAL R ISK M INIMIZATION OVER D ONSKER C LASSES of decay of the L1 diameter, hence verifying condition (1) in Lemma 23. [sent-351, score-0.259]

93 P f − P f > Cλ(n)−1 / 2 √ ≤ Pr∗ ∃ f , f ∈ F , Pn f − Pn f ≤ ξ(n), P f − P f > Cλ(n)−1 / 2 ≤ Pr∗ C sup |P( f − f ) − Pn ( f − f )| > √ λ(n)−1 − ξ(n) 2 f , f ∈F C = Pr∗ λ(n) sup |Pg − Pn g| > √ − ξ(n)λ(n) 2 g∈G → 0, proving condition (1) in Lemma 23. [sent-361, score-0.554]

94 Since F is a VC-subgraph class of dimension V , its entropy numbers log N (ε, F ) behave like V log A (A is a constant), that is ε log N 1 −1/2 n λ(n)ρ−1 , F 2 1 ≤ const + V log n + (1 − ρ)V log λ(n). [sent-363, score-0.185]

95 Hence, by Lemma 23   √ √ Pr∗  n sup |P( f − f )| ≤ nξ(n) + 131nγ(ρ−1)  → 0. [sent-366, score-0.277]

96 We obtain   √ √ Pr∗ nκ n sup |P( f − f )| ≤ nξ(n)nκ + 131nγ(ρ−1)+κ  → 0. [sent-368, score-0.277]

97 Hence, n1/2+κ P∗ sup ξ(n) f , f ∈MS |P( f − f )| − 0 → 2581 C APONNETTO AND R AKHLIN for any κ < min 1 6(2V +1) min(α, η), 1/2 . [sent-374, score-0.277]

98 Koml´ s-Major-Tusn´ dy approximation for the general empirical process and Haar o a expansion of classes of functions. [sent-445, score-0.125]

99 Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. [sent-467, score-0.099]

100 The necessary and sufficient conditions for consistency in the empirical risk minimization method. [sent-527, score-0.099]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('donsker', 0.392), ('pr', 0.291), ('sup', 0.277), ('diamms', 0.271), ('erm', 0.24), ('ms', 0.235), ('pn', 0.225), ('np', 0.183), ('kmt', 0.151), ('akhlin', 0.136), ('aponnetto', 0.136), ('onsker', 0.136), ('stability', 0.135), ('fl', 0.125), ('vaart', 0.115), ('lasses', 0.115), ('isk', 0.115), ('tability', 0.115), ('fk', 0.115), ('diameter', 0.107), ('rakhlin', 0.105), ('wellner', 0.105), ('inimization', 0.103), ('mpirical', 0.103), ('roperties', 0.103), ('minimizers', 0.092), ('zn', 0.086), ('inf', 0.086), ('der', 0.079), ('mukherjee', 0.078), ('decay', 0.076), ('zi', 0.07), ('balls', 0.068), ('empirical', 0.062), ('entropy', 0.06), ('dudley', 0.06), ('kutin', 0.06), ('pollard', 0.06), ('mn', 0.06), ('envelope', 0.059), ('van', 0.058), ('contract', 0.052), ('uniform', 0.051), ('lemma', 0.048), ('seminorm', 0.045), ('corollary', 0.045), ('covering', 0.045), ('devroye', 0.045), ('fn', 0.043), ('supx', 0.042), ('bracketing', 0.042), ('classes', 0.04), ('vc', 0.039), ('jump', 0.038), ('caponnetto', 0.038), ('risk', 0.037), ('niyogi', 0.037), ('theorem', 0.035), ('bridge', 0.034), ('poggio', 0.031), ('characterizations', 0.03), ('crcns', 0.03), ('rudelson', 0.03), ('supq', 0.03), ('suprema', 0.03), ('expected', 0.029), ('wagner', 0.029), ('measurable', 0.029), ('goes', 0.028), ('koltchinskii', 0.027), ('proposition', 0.027), ('limit', 0.026), ('covariance', 0.026), ('perturbations', 0.026), ('measurability', 0.026), ('brownian', 0.026), ('bracket', 0.026), ('algorithmic', 0.026), ('log', 0.025), ('tolerance', 0.023), ('gin', 0.023), ('naval', 0.023), ('andrea', 0.023), ('process', 0.023), ('rates', 0.023), ('proof', 0.023), ('studying', 0.022), ('centers', 0.022), ('bartlett', 0.022), ('uniformly', 0.022), ('borel', 0.021), ('pf', 0.021), ('raised', 0.021), ('central', 0.02), ('darpa', 0.02), ('mendelson', 0.02), ('eq', 0.02), ('possess', 0.02), ('lling', 0.02), ('au', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

2 0.091062523 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

3 0.077830993 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of the classification error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.

4 0.076871224 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

5 0.071558617 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

Author: Guillaume Lecué

Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence

6 0.066193104 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

7 0.065127388 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

8 0.063774057 48 jmlr-2006-Learning Minimum Volume Sets

9 0.063100882 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

10 0.050806887 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

11 0.047901548 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

12 0.044309776 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

13 0.040247049 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data     (Special Topic on Machine Learning and Optimization)

14 0.03612889 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

15 0.035498753 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

16 0.032846916 45 jmlr-2006-Learning Coordinate Covariances via Gradients

17 0.03006814 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra     (Special Topic on Machine Learning and Optimization)

18 0.029847624 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies

19 0.027493376 76 jmlr-2006-QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

20 0.027157608 16 jmlr-2006-Bounds for Linear Multi-Task Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.154), (1, -0.045), (2, -0.065), (3, -0.242), (4, -0.044), (5, 0.19), (6, -0.044), (7, 0.085), (8, 0.009), (9, 0.068), (10, -0.064), (11, -0.048), (12, 0.078), (13, -0.073), (14, -0.075), (15, -0.039), (16, 0.078), (17, 0.031), (18, -0.063), (19, -0.01), (20, -0.088), (21, -0.103), (22, -0.059), (23, 0.028), (24, 0.112), (25, -0.316), (26, -0.072), (27, 0.226), (28, -0.02), (29, 0.104), (30, -0.1), (31, 0.048), (32, -0.13), (33, -0.073), (34, 0.079), (35, 0.031), (36, 0.033), (37, 0.094), (38, -0.073), (39, -0.088), (40, 0.026), (41, 0.013), (42, 0.155), (43, -0.113), (44, -0.055), (45, -0.078), (46, -0.103), (47, 0.106), (48, -0.167), (49, -0.146)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96606529 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

2 0.47937796 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

3 0.40021992 48 jmlr-2006-Learning Minimum Volume Sets

Author: Clayton D. Scott, Robert D. Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency, an oracle inequality, and rates of convergence. The proposed estimators are illustrated with histogram and decision tree set estimation rules. Keywords: minimum volume sets, anomaly detection, statistical learning theory, uniform deviation bounds, sample complexity, universal consistency

4 0.34986049 81 jmlr-2006-Some Discriminant-Based PAC Algorithms

Author: Paul W. Goldberg

Abstract: A classical approach in multi-class pattern classification is the following. Estimate the probability distributions that generated the observations for each label class, and then label new instances by applying the Bayes classifier to the estimated distributions. That approach provides more useful information than just a class label; it also provides estimates of the conditional distribution of class labels, in situations where there is class overlap. We would like to know whether it is harder to build accurate classifiers via this approach, than by techniques that may process all data with distinct labels together. In this paper we make that question precise by considering it in the context of PAC learnability. We propose two restrictions on the PAC learning framework that are intended to correspond with the above approach, and consider their relationship with standard PAC learning. Our main restriction of interest leads to some interesting algorithms that show that the restriction is not stronger (more restrictive) than various other well-known restrictions on PAC learning. An alternative slightly milder restriction turns out to be almost equivalent to unrestricted PAC learning. Keywords: computational learning theory, computational complexity, pattern classification

5 0.32556066 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

Author: Daniil Ryabko

Abstract: In this work we consider the task of relaxing the i.i.d. assumption in pattern recognition (or classification), aiming to make existing learning algorithms applicable to a wider range of tasks. Pattern recognition is guessing a discrete label of some object based on a set of given examples (pairs of objects and labels). We consider the case of deterministically defined labels. Traditionally, this task is studied under the assumption that examples are independent and identically distributed. However, it turns out that many results of pattern recognition theory carry over a weaker assumption. Namely, under the assumption of conditional independence and identical distribution of objects, while the only assumption on the distribution of labels is that the rate of occurrence of each label should be above some positive threshold. We find a broad class of learning algorithms for which estimations of the probability of the classification error achieved under the classical i.i.d. assumption can be generalized to the similar estimates for case of conditionally i.i.d. examples.

6 0.32411748 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

7 0.25242707 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

8 0.24602769 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

9 0.24099961 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints     (Special Topic on Machine Learning and Optimization)

10 0.21028475 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

11 0.20945108 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

12 0.2039808 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

13 0.19506797 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies

14 0.18717915 16 jmlr-2006-Bounds for Linear Multi-Task Learning

15 0.1714523 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra     (Special Topic on Machine Learning and Optimization)

16 0.17077914 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data     (Special Topic on Machine Learning and Optimization)

17 0.16325051 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

18 0.16055186 72 jmlr-2006-Parallel Software for Training Large Scale Support Vector Machines on Multiprocessor Systems     (Special Topic on Machine Learning and Optimization)

19 0.15515032 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

20 0.15502231 2 jmlr-2006-A Graphical Representation of Equivalence Classes of AMP Chain Graphs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.015), (36, 0.036), (45, 0.015), (50, 0.07), (63, 0.018), (76, 0.011), (81, 0.649), (84, 0.015), (90, 0.022), (91, 0.019), (96, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9287371 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

Author: Seyoung Kim, Padhraic Smyth

Abstract: This paper proposes a general probabilistic framework for shape-based modeling and classification of waveform data. A segmental hidden Markov model (HMM) is used to characterize waveform shape and shape variation is captured by adding random effects to the segmental model. The resulting probabilistic framework provides a basis for learning of waveform models from data as well as parsing and recognition of new waveforms. Expectation-maximization (EM) algorithms are derived and investigated for fitting such models to data. In particular, the “expectation conditional maximization either” (ECME) algorithm is shown to provide significantly faster convergence than a standard EM procedure. Experimental results on two real-world data sets demonstrate that the proposed approach leads to improved accuracy in classification and segmentation when compared to alternatives such as Euclidean distance matching, dynamic time warping, and segmental HMMs without random effects. Keywords: waveform recognition, random effects, segmental hidden Markov models, EM algorithm, ECME

same-paper 2 0.91999322 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

3 0.80983448 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

Author: Francis R. Bach, David Heckerman, Eric Horvitz

Abstract: Receiver Operating Characteristic (ROC) curves are a standard way to display the performance of a set of binary classifiers for all feasible ratios of the costs associated with false positives and false negatives. For linear classifiers, the set of classifiers is typically obtained by training once, holding constant the estimated slope and then varying the intercept to obtain a parameterized set of classifiers whose performances can be plotted in the ROC plane. We consider the alternative of varying the asymmetry of the cost function used for training. We show that the ROC curve obtained by varying both the intercept and the asymmetry, and hence the slope, always outperforms the ROC curve obtained by varying only the intercept. In addition, we present a path-following algorithm for the support vector machine (SVM) that can compute efficiently the entire ROC curve, and that has the same computational complexity as training a single classifier. Finally, we provide a theoretical analysis of the relationship between the asymmetric cost model assumed when training a classifier and the cost model assumed in applying the classifier. In particular, we show that the mismatch between the step function used for testing and its convex upper bounds, usually used for training, leads to a provable and quantifiable difference around extreme asymmetries. Keywords: support vector machines, receiver operating characteristic (ROC) analysis, linear classification

4 0.58156234 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs

5 0.54228634 66 jmlr-2006-On Model Selection Consistency of Lasso

Author: Peng Zhao, Bin Yu

Abstract: Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are “irrepresentable” (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. Keywords: Lasso, regularization, sparsity, model selection, consistency

6 0.43526238 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

7 0.43177262 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

8 0.42259955 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

9 0.38824084 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

10 0.38541672 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

11 0.37945265 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

12 0.37435281 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

13 0.37107554 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

14 0.3516494 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

15 0.34730294 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

16 0.34687307 49 jmlr-2006-Learning Parts-Based Representations of Data

17 0.34417564 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

18 0.34390366 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

19 0.34315765 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

20 0.32941914 16 jmlr-2006-Bounds for Linear Multi-Task Learning