jmlr jmlr2006 jmlr2006-24 knowledge-graph by maker-knowledge-mining

24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss


Source: pdf

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 CHINA Editor: Peter Bartlett Abstract The consistency of classification algorithm plays a central role in statistical learning theory. [sent-10, score-0.082]

2 A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. [sent-11, score-0.039]

3 We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. [sent-12, score-0.179]

4 Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. [sent-13, score-0.122]

5 In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. [sent-14, score-0.04]

6 Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition 1. [sent-15, score-0.206]

7 Introduction We consider the consistency of empirical risk minimization (ERM) algorithm in multiclass classification. [sent-16, score-0.236]

8 For this purpose, a very successful method used in binary classification is to solve a minimization problem of a risk based on a convex loss φ. [sent-39, score-0.118]

9 Main examples of φ include the exponential loss φ(x) = e −x used in AdaBoost, the logit loss φ(x) = ln(1 + e−x ) and the hinge loss φ(x) = (1 − x)+ used in support vector machine, where (u)+ = max{0, u} for a number u ∈ R. [sent-40, score-0.063]

10 Probably since one can solve a multiclass classification problem (K > 2) by solving several binary classification problems, there are much fewer studies on multiclass classification algorithms based directly on minimizing empirical risk with convex loss. [sent-41, score-0.276]

11 Recently, Zhang (2004b) proposes a natural version of EMR scheme in solving a multiclass problem: n ˆ = arg min 1 ∑ ΨYi (f(Xi )), f f∈F n i=1 (1) where Ψc is a mapping from RK to R, which is usually constructed by some convex loss function φ. [sent-42, score-0.187]

12 In the following, we use bold symbols such as f and q to denote vectors, and f c and qc to denote their c-th component. [sent-43, score-0.733]

13 Once obtaining ˆ we have a f, classifier C(ˆ where C(f) is defined by f), C(f)(x) = arg max fc (x), ∀f = ( f1 (x), . [sent-45, score-0.335]

14 R (C(ˆ A very desirable property is the consistency of algorithm: the excess error R (C( ˆ − R ∗ → 0 in some sense, as the size n of samples increases to ∞. [sent-50, score-0.218]

15 A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. [sent-51, score-0.039]

16 In different versions of j=1 j=1 boosting, bagging and arcing algorithms, the output classifiers are constructed by weighted voting schemes. [sent-74, score-0.046]

17 Their consistency is established in Lugosi and Vayatis (2004) under the assumption that the Bayes classifier can be approximated by Fλ and H has a finite VC dimension. [sent-75, score-0.117]

18 (2005) is extended to multiclass classification problem and is used to characterize above implication in Tewari and Bartlett (2005). [sent-84, score-0.123]

19 Such an implication is also established under the so called infinite-sample-consistency (ISC) condition on {Ψc }c (see Zhang, 2004b). [sent-85, score-0.069]

20 Moreover, an quantitative relation between the excess error and the excess {Ψc }-risk is obtained for One-versus-All method in Zhang (2004b). [sent-86, score-0.26]

21 In this paper we consider the constrained comparison method in multiclass classification problem. [sent-87, score-0.112]

22 One of our goals is to generalize the results of consistency for weighted voting schemes in Lugosi and Vayatis (2004) to multiclass case. [sent-88, score-0.249]

23 We first establish an inequality concerning with the excess error and the excess {Ψc }-risk. [sent-89, score-0.312]

24 In Section 2, we upper bound the excess error by the excess {Ψc }-risk under some mild conditions. [sent-92, score-0.251]

25 On the other hand, the sufficient conditions ensuring the quantitative relationships, even in case K = 2, are expressed previously in terms of the infimum inff E(ΨY (f(X))|X = x). [sent-94, score-0.102]

26 In Section 3, we apply the results in Section 2 to establish a consistency result in multiclass case, similar to that of Lugosi and Vayatis (2004). [sent-95, score-0.218]

27 Bounding Classification Error by Convexity In this section, we upper bound, under some conditions on convex loss φ, the excess classification error R (C(f)) − R ∗ by the excess {Ψc }-risk E (f) − E ∗ for the constrained comparison method. [sent-97, score-0.31]

28 Let q(x) = (qc (x))K , c=1 qc (x) = P{Y = c|X = x}. [sent-103, score-0.733]

29 Ψc (f) = K ∑ φ(− fk ), f ∈ Ω := f ∈ RK : ∑ fc = 0 . [sent-106, score-0.36]

30 c k=1,k=c Then the risk E (f) may by expressed as E (f) = EX W (q(X), f(X)), 2437 (2) C HEN AND S UN with W (q, f) = ∑K (1 − qc )φ(− fc ). [sent-107, score-1.081]

31 c=1 Note that we use qc to denote the c-th component of a K-dimensional vector q ∈ Λ K , where ΛK is the set of possible conditional probability vectors: ΛK = q ∈ R K : K ∑ qc = 1, qc ≥ 0 . [sent-108, score-2.199]

32 1 Assume that φ is a decreasing and convex function on R. [sent-114, score-0.063]

33 Suppose that q ∈ ΛK and f ∈ BΩ satisfy that there are i, j such that qi < q j and f j < fi . [sent-116, score-0.252]

34 Then f +f W (q, f ) ≤ W (q, f), where f = ( f1 , · · · , fK ) is given by f i = f j = i 2 j , and fc = fc , c = i, j. [sent-117, score-0.612]

35 2 Assume that φ is a decreasing and convex function on R. [sent-124, score-0.063]

36 Suppose that there exist positive constants k > 0 and α ≥ 1 such that for any q ∈ ΛK , k(q j − qi )α ≤ W ∗ (q ) −W ∗ (q), where j = arg maxc qc and qi < q j , and q is given by q = (q1 , · · · , qK ), where qi = q j = qc = qc , c = i, j. [sent-125, score-3.064]

37 Then for any f ∈ BΩ , (3) qi +q j 2 , and 1 k(R (C(f)) − R ∗ ) ≤ E (f) − E ∗ ) α . [sent-126, score-0.252]

38 Recall that qc (x) is the conditional probability P{Y = c|X = x}. [sent-128, score-0.733]

39 Therefore by (3) k(q j (x) − qi (x))α ≤ W (q(x), f(x)) −W ∗ (q(x)). [sent-139, score-0.252]

40 φ is a differentiable, convex and decreasing function on R such that lim φ(x) = 0 and lim φ(x) = +∞. [sent-147, score-0.113]

41 For any q = (qc )K ∈ ΛK with all qc < 1, c ∈ {1, . [sent-149, score-0.733]

42 , K}, there is a minimizer f∗ = ( fc )K of c=1 c=1 ∗ , c = 1, . [sent-152, score-0.342]

43 For any q = (qc )K , let j = arg maxc qc and i ∈ {1, . [sent-160, score-0.842]

44 We introduce qt = c=1 q −q (qtc )K ∈ ΛK for 0 ≤ t ≤ j 2 i as following. [sent-164, score-0.113]

45 c=1 qti = qi + t, qtj = q j − t, and qtc = qc , c = i, j. [sent-165, score-1.08]

46 q −q t,∗ Clearly, qtc < 1 for 0 < t < j 2 i and any 1 ≤ c ≤ K. [sent-166, score-0.095]

47 Therefore, for any t, there is a ft,∗ = ( fc )K c=1 minimizing W (qt , f), that is, W ∗ (qt ) = W (qt , ft,∗ ). [sent-167, score-0.306]

48 3, Zhang (2004b) proves that the excess error R (C(f)) − R ∗ is small whenever the excess {Ψ}-risk E (f) − E ∗ is small. [sent-169, score-0.234]

49 3, to establish an inequality between the above two quantities. [sent-171, score-0.078]

50 Suppose that there exist positive constants k1 > 0 and β ≥ 0 such that for any q ∈ ΛK , k1 (q j − qi − 2t)β ≤ φ(− f t,∗ ) − φ(− fit,∗ ), j 2439 0 < q j − qi , 2 (6) C HEN AND S UN t,∗ whenever j = arg maxc qc , qi < q j and ft,∗ = ( fc )K is a minimizer of W (qt , f). [sent-179, score-1.94]

51 We establish condition (3) with α = β + 1 and k = 2(β+1) . [sent-182, score-0.065]

52 As above, let ft,∗ = ( fc )K be c=1 the minimizer of W (qt , f). [sent-183, score-0.342]

53 The first-order optimality condition is the set of equations t,∗ (1 − qtc )φ (− fc ) = µ, c = 1, . [sent-184, score-0.427]

54 Moreover, the constraint ∑K fc = 0 (∀t ∈ (0, q2 −q1 )) yields c=1 2 ∑K c=1 t,∗ d fc dt = 0. [sent-193, score-0.645]

55 Consequently, dW ∗ (qt ) dt K t,∗ t,∗ fc = φ(− f t,∗ ) − φ(− fit,∗ ) − ∑ (1 − qtc )φ (− fc ) d dt j = φ(− f t,∗ ) − φ(− fit,∗ ). [sent-194, score-0.773]

56 j c=1 Therefore, we have by (6) dW ∗ (qt ) ≥ k1 (q j − qi − 2t)β , dt 0 < q j − qi . [sent-195, score-0.537]

57 For q = (qc )K ∈ ΛK with all qc < 1, the unique minimizer f∗ = ( fc )K is determined by c=1 c=1 ∗ ) = µ, c = 1, · · · , K, with µ the Lagrangian multiplier. [sent-205, score-1.075]

58 (1 − qc ) exp ( fc ∗ By ∑c fc = 0 we have µ = K ∏K (1 − qc ). [sent-208, score-2.078]

59 Therefore c=1 K ∗ φ(− fk ) = ∏K (1 − qc ) c=1 1 − qk , k = 1, . [sent-209, score-0.844]

60 We apply the above equality and obtain K φ(− f t,∗ ) − φ(− fit,∗ ) j = ∏ (1 − qc ) c=i, j ((1 − q j + t)(1 − qi − t)) 2440 K−1 K (q j − qi − 2t). [sent-218, score-1.237]

61 M ULTICLASS E MPIRICAL R ISK M INIMIZATION M ETHODS If K = 2, ∏ (1 − qc ) is understood as 1. [sent-219, score-0.733]

62 Therefore ∏ (1 − qc ) ≥ c=i, j K−1 c=2 On the other hand, (1 − q j +t)(1 − qi −t) ≤ (1 − Consequently, qi +q j 2 2 ) for 0 ≤ t ≤ qi −q j qi +q j 2 . [sent-224, score-1.741]

63 The resulting risk is just the one used This is (6) with β = 1 and k1 = K Let in p-norm Support vector machine (SVM). [sent-232, score-0.042]

64 Chen and Xiang (2004) have established the inequality for p = 1 R (C(f)) − R ∗ ≤ E (f) − E ∗ . [sent-233, score-0.056]

65 For q = (qc )k with all qc < 1, by the method of Lagrange multiplier we conclude that the c=1 1 ∗ ∗ minimizer f∗ = ( fc )K satisfies − fc < K−1 , c ∈ {1, . [sent-238, score-1.381]

66 Moreover, we have ∗ φ(− fk ) = K K −1 2 1 (1 − qk )2 K ∑ c=1 , k = 1, . [sent-244, score-0.111]

67 An application of the above equality to qt and ft,∗ yields φ(− f t,∗ ) − φ(− fit,∗ ) j = K K−1 2 (q j − qi − 2t)(2 − qi − q j ) (1 − qi − t)2 (1 − q j + t)2 1 (1−qi +t)2 + (1−q1j +t)2 + ∑ c=i, j 1 (1−qc )2 , 1 where ∑c=i, j (1−qc )2 is understood as 0 for K = 2. [sent-253, score-0.869]

68 It is easily seen that 1 1 + (1 − qi + t)2 (1 − q j + t)2 q j + qi 2 2K − 1 2 ) ≤ 2( ) , ≤ 2(1 − 2 2K (1 − qi − t)2 (1 − q j + t)2 2441 ∀t ∈ [0, q j − qi ], 2 C HEN AND S UN where the second inequality holds by 1/K ≤ q j . [sent-254, score-1.047]

69 5, again we arrange qc , c = i, j, in decreasing order so that they are not larger q −q 1 than 1 , . [sent-256, score-0.756]

70 It follows that, for 0 ≤ t ≤ j 2 i , 2 (1 − qi − t)2 (1 − q j + t)2 2K − 1 1 ≤ 2 2K c=i, j (1 − qc ) 4 ∑ Therefore, the condition (6) holds with β = 1 and k1 = Theorem 2. [sent-260, score-1.011]

71 We point out that − f c < K−1 for any q = (qc )K with all qc < 1, c = 1, . [sent-269, score-0.733]

72 Moreover, by simple computation, ∗ φ(− fk ) = K K −1 p p−1 1 ∑K c=1 1−qk 1−qc p p−1 , k = 1, . [sent-274, score-0.054]

73 For any x ∈ X , let m(x) = qφB (x) (x) − max{qi (x) : qi (x) < qφB (x) (x), i = 1 . [sent-281, score-0.252]

74 , K} if the set {qi (x) : qi (x) < qφB (x) (x), i = 1 . [sent-284, score-0.252]

75 We say that P satisfies Tsybakov noise condition with exponent s, if there is a constant c such that s P{X ∈ X : 0 < m(X) < t} ≤ ct 1−s , 0 < t ≤ 1. [sent-290, score-0.105]

76 As in binary classification (see Bartlett and Mendelson, 2002), Tsybakov noise condition with exponent s implies that there is a constant c such that, for any f ∈ B Ω , P{x : x ∈ X , qφB (x) (x) = qC(f)(x) (x)} ≤ c(R (C(f)) − R ∗ )s . [sent-291, score-0.105]

77 In fact, Tsybakov noise condition and (4) tell us ≥ R (C(f)) − R ∗ Z X qφB (x) (x) − qC(f)(x) (x) I{t≤m(x)} dρX ≥ t P{x : x ∈ X , qφB (x) (x) = qC(f)(x) (x)} − ct Minimizing the last term over t establishes (7). [sent-292, score-0.052]

78 9 Suppose that P satisfies Tsybakov noise condition with exponent s. [sent-296, score-0.105]

79 Consequently, under Tsybakov noise condition with exponent s and conditions of Theorem 2. [sent-299, score-0.105]

80 On the other hand, Z X2 qφB (x) (x) − qC(f)(x) (x) dρX Z α qφB (x) (x) − qC(f)(x) (x) dρX ≤ t −α+1 X 1 (E (f) − E ∗ ), ≤ α−1 kt where the last inequality follows from the proof of Lemma 2. [sent-304, score-0.06]

81 Minimizing the right hand side of above inequality over t ∈ (0, 1] yields the inequality (8) for some constant cφ . [sent-307, score-0.078]

82 Consistency of Weighted Voting Schemes In this section, we consider the consistency of weight voting schemes by the results of section 2. [sent-312, score-0.152]

83 Suppose that the set H of classifiers satisfies lim inf E (f) = E ∗ . [sent-317, score-0.052]

84 Then for any n and λ > 0 we have E sup |E (f) − En (f)| ≤ 4K 2 λ|φ (−λ(K − 1))| f∈Fλ 2V ln (4n + 2) , n (9) where V = max VAc . [sent-332, score-0.137]

85 Also, for any δ > 0, with probability at least 1 − δ, 1≤c≤K sup |E (f) − En (f)| f∈Fλ ≤ 4K 2 λ|φ (−λ(K − 1))| −nδ2 2V ln (4n + 2) . [sent-333, score-0.137]

86 , K} 1 n 1 n σi (φ(−λ fc (Xi )) − φ(0)) ≤ 2LE sup ∑ σi fc (Xi ) , ∑ f∈F1 n i=1 f∈F1 n i=1 E sup and consequently, K 1 n 1 n sup ∑ σi (ΨYi (f(Xi )) − 1) ≤ 2L ∑ E f∈F n ∑ σi fc (Xi ) . [sent-350, score-1.203]

87 f∈Fλ n i=1 1 c=1 i=1 E sup 2444 (11) M ULTICLASS E MPIRICAL R ISK M INIMIZATION M ETHODS Since any f c = ∑ j α j Tc (h j ) is a convex combination of Tc (h j ) with h j ∈ H , it follows that 1 n 1 n ∑ σi fc (Xi ) = sup n ∑ σi Tc (h(Xi )) . [sent-351, score-0.536]

88 By a version of the Vapnik-Chervonenkis inequality we conclude 1 n ∑ σi Tc (h(Xi )) ≤ (K − 1) h∈H n i=1 E sup 2VAc ln (4n + 2) , n c = 1, . [sent-357, score-0.176]

89 It is easily seen that the random variable sup |E (f) − En (f)| satisfies the bounded difference f∈Fλ assumption with constant ci = 2(K − 1)φ(−λK)/n, 1 ≤ i ≤ n. [sent-366, score-0.113]

90 Now inequality (10) follows from (9) and McDiarmid’s bounded difference inequality (see Lugosi, 2002; McDiarmid, 1989). [sent-367, score-0.078]

91 Choose λn such that λn → ∞ and λn φ (−λn (K − 1)) ln n → 0 as n → ∞. [sent-376, score-0.042]

92 1, we have the consistency lim ER (C(ˆn )) = R ∗ . [sent-382, score-0.107]

93 By (13) we have E (ˆn ) − E (fλn ) f f f f = E (ˆn ) − En (ˆn ) + En (ˆn ) − En (fλn ) + En (fλn ) − E (fλn ) ≤ 2 sup |E (f) − En (f)| + εn . [sent-385, score-0.095]

94 f∈Fλn Therefore, f EE (ˆn ) ≤ 2E sup |E (f) − En (f)| + E (fλn ) + εn . [sent-386, score-0.095]

95 In this case, we thus choose λn such that ln n → 0. [sent-396, score-0.042]

96 λn → ∞ and λn eλn (K−1) n If the set H has a finite VC dimension and, for any samples {(Xi ,Yi )}n , (13) holds, then we have i=1 the consistency stated in Theorem 3. [sent-397, score-0.101]

97 Rademacher and Gaussian complexities: risk bounds and structural results. [sent-422, score-0.042]

98 Bartlett: On the consistency of multiclass classification methods, In Proc. [sent-474, score-0.179]

99 Statistical behavior and consistency of classification methods based on convex risk minimization. [sent-483, score-0.164]

100 Statistical analysis of some multiclass large margin classification method. [sent-487, score-0.097]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qc', 0.733), ('fc', 0.306), ('qi', 0.252), ('tc', 0.189), ('en', 0.125), ('excess', 0.117), ('ethods', 0.113), ('qt', 0.113), ('tsybakov', 0.098), ('multiclass', 0.097), ('hen', 0.096), ('isk', 0.096), ('ulticlass', 0.096), ('sup', 0.095), ('qtc', 0.095), ('fit', 0.086), ('inimization', 0.086), ('mpirical', 0.086), ('consistency', 0.082), ('maxc', 0.08), ('vayatis', 0.079), ('inff', 0.076), ('ft', 0.072), ('lugosi', 0.072), ('beijing', 0.064), ('classi', 0.058), ('qk', 0.057), ('china', 0.057), ('vac', 0.057), ('fk', 0.054), ('exponent', 0.053), ('bartlett', 0.05), ('un', 0.05), ('vc', 0.049), ('chen', 0.046), ('voting', 0.046), ('ee', 0.043), ('erm', 0.043), ('risk', 0.042), ('ln', 0.042), ('convex', 0.04), ('establish', 0.039), ('inequality', 0.039), ('zhang', 0.038), ('minimizer', 0.036), ('dt', 0.033), ('tao', 0.032), ('tewari', 0.032), ('aeronautics', 0.032), ('astronautics', 0.032), ('va', 0.032), ('arg', 0.029), ('dw', 0.029), ('devroye', 0.028), ('inf', 0.027), ('mcdiarmid', 0.026), ('noise', 0.026), ('quantitative', 0.026), ('implication', 0.026), ('condition', 0.026), ('er', 0.026), ('lim', 0.025), ('lemma', 0.025), ('schemes', 0.024), ('suppose', 0.023), ('ers', 0.023), ('decreasing', 0.023), ('xi', 0.022), ('sun', 0.022), ('satis', 0.021), ('kt', 0.021), ('mum', 0.021), ('integrating', 0.021), ('loss', 0.021), ('yi', 0.02), ('moreover', 0.02), ('reconstruct', 0.02), ('theorem', 0.019), ('samples', 0.019), ('cation', 0.018), ('consequently', 0.018), ('lipschitz', 0.018), ('assumption', 0.018), ('rk', 0.017), ('established', 0.017), ('mild', 0.017), ('cn', 0.017), ('reseach', 0.016), ('wien', 0.016), ('xiang', 0.016), ('constrained', 0.015), ('minimization', 0.015), ('base', 0.015), ('boosting', 0.014), ('hj', 0.014), ('ables', 0.014), ('varii', 0.014), ('boucheron', 0.014), ('ew', 0.014), ('ez', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

2 0.15625776 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin

Abstract: The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. In machine learning, this model is related to multi-class probability estimates by coupling all pairwise classification results. Error correcting output codes (ECOC) are a general framework to decompose a multi-class problem to several binary problems. To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. We propose a simple algorithm with convergence proofs to solve the model and obtain individual skill. Experiments on synthetic and real data demonstrate that the algorithm is useful for obtaining multi-class probability estimates. Moreover, we discuss four extensions of the proposed model: 1) weighted individual skill, 2) home-field advantage, 3) ties, and 4) comparisons with more than two teams. Keywords: Bradley-Terry model, probability estimates, error correcting output codes, support vector machines

3 0.10403949 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

4 0.091062523 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

5 0.072596923 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

Author: Régis Vert, Jean-Philippe Vert

Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation

6 0.06033808 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

7 0.043089963 48 jmlr-2006-Learning Minimum Volume Sets

8 0.042439729 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

9 0.04220365 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

10 0.03917788 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

11 0.037213616 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study     (Special Topic on Machine Learning and Optimization)

12 0.034802191 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

13 0.033399034 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

14 0.032745466 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann

15 0.03273084 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

16 0.031954382 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

17 0.026025452 79 jmlr-2006-Second Order Cone Programming Approaches for Handling Missing and Uncertain Data     (Special Topic on Machine Learning and Optimization)

18 0.023427604 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

19 0.022551827 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

20 0.021551974 13 jmlr-2006-Adaptive Prototype Learning Algorithms: Theoretical and Experimental Studies


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, -0.01), (2, -0.036), (3, -0.169), (4, -0.033), (5, 0.259), (6, 0.023), (7, 0.086), (8, 0.178), (9, -0.127), (10, 0.058), (11, -0.159), (12, -0.154), (13, -0.309), (14, -0.094), (15, -0.075), (16, 0.114), (17, 0.078), (18, 0.035), (19, 0.079), (20, -0.2), (21, -0.044), (22, -0.017), (23, 0.027), (24, 0.041), (25, -0.243), (26, 0.017), (27, 0.036), (28, 0.016), (29, 0.093), (30, 0.109), (31, -0.037), (32, -0.063), (33, 0.129), (34, 0.038), (35, 0.007), (36, 0.002), (37, -0.043), (38, 0.112), (39, 0.037), (40, 0.107), (41, 0.051), (42, -0.025), (43, -0.085), (44, 0.024), (45, 0.025), (46, 0.042), (47, 0.019), (48, 0.014), (49, -0.146)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95627546 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

2 0.63505352 34 jmlr-2006-Generalized Bradley-Terry Models and Multi-Class Probability Estimates

Author: Tzu-Kuo Huang, Ruby C. Weng, Chih-Jen Lin

Abstract: The Bradley-Terry model for obtaining individual skill from paired comparisons has been popular in many areas. In machine learning, this model is related to multi-class probability estimates by coupling all pairwise classification results. Error correcting output codes (ECOC) are a general framework to decompose a multi-class problem to several binary problems. To obtain probability estimates under this framework, this paper introduces a generalized Bradley-Terry model in which paired individual comparisons are extended to paired team comparisons. We propose a simple algorithm with convergence proofs to solve the model and obtain individual skill. Experiments on synthetic and real data demonstrate that the algorithm is useful for obtaining multi-class probability estimates. Moreover, we discuss four extensions of the proposed model: 1) weighted individual skill, 2) home-field advantage, 3) ties, and 4) comparisons with more than two teams. Keywords: Bradley-Terry model, probability estimates, error correcting output codes, support vector machines

3 0.48146155 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

Author: Andrea Caponnetto, Alexander Rakhlin

Abstract: We study some stability properties of algorithms which minimize (or almost-minimize) empirical error over Donsker classes of functions. We show that, as the number n of samples grows, the L 2 1 diameter of the set of almost-minimizers of empirical error with tolerance ξ(n) = o(n − 2 ) converges to zero in probability. Hence, even in the case of multiple minimizers of expected error, as n increases it becomes less and less likely that adding a sample (or a number of samples) to the training set will result in a large jump to a new hypothesis. Moreover, under some assumptions on the entropy of the class, along with an assumption of Komlos-Major-Tusnady type, we derive a power rate of decay for the diameter of almost-minimizers. This rate, through an application of a uniform ratio limit inequality, is shown to govern the closeness of the expected errors of the almost-minimizers. In fact, under the above assumptions, the expected errors of almost-minimizers become closer with a rate strictly faster than n−1/2 . Keywords: empirical risk minimization, empirical processes, stability, Donsker classes

4 0.46810937 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

Author: Ross A. Lippert, Ryan M. Rifkin

Abstract: We consider the problem of Tikhonov regularization with a general convex loss function: this formalism includes support vector machines and regularized least squares. For a family of kernels that includes the Gaussian, parameterized by a “bandwidth” parameter σ, we characterize the limiting ˜ solution as σ → ∞. In particular, we show that if we set the regularization parameter λ = λσ−2p , the regularization term of the Tikhonov problem tends to an indicator function on polynomials of degree ⌊p⌋ (with residual regularization in the case where p ∈ Z). The proof rests on two key ideas: epi-convergence, a notion of functional convergence under which limits of minimizers converge to minimizers of limits, and a value-based formulation of learning, where we work with regularization on the function output values (y) as opposed to the function expansion coefficients in the RKHS. Our result generalizes and unifies previous results in this area. Keywords: Tikhonov regularization, Gaussian kernel, theory, kernel machines

5 0.27461556 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

Author: Régis Vert, Jean-Philippe Vert

Abstract: We determine the asymptotic behaviour of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. Keywords: regularization, Gaussian kernel RKHS, one-class SVM, convex loss functions, kernel density estimation

6 0.21892086 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

7 0.20772386 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

8 0.19695 55 jmlr-2006-Linear Programming Relaxations and Belief Propagation -- An Empirical Study     (Special Topic on Machine Learning and Optimization)

9 0.18095563 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery

10 0.17778873 48 jmlr-2006-Learning Minimum Volume Sets

11 0.14905818 27 jmlr-2006-Ensemble Pruning Via Semi-definite Programming     (Special Topic on Machine Learning and Optimization)

12 0.14512832 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

13 0.14223945 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

14 0.1414154 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

15 0.14094211 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation

16 0.13998696 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

17 0.1377923 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

18 0.13775705 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

19 0.11913314 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

20 0.11878043 53 jmlr-2006-Learning a Hidden Hypergraph


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.019), (36, 0.044), (45, 0.02), (50, 0.14), (63, 0.018), (67, 0.406), (76, 0.011), (78, 0.011), (81, 0.087), (84, 0.02), (90, 0.044), (91, 0.02), (96, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91135061 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

same-paper 2 0.74244124 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss

Author: Di-Rong Chen, Tao Sun

Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition

3 0.44778085 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

Author: Rémi Munos

Abstract: We study a variance reduction technique for Monte Carlo estimation of functionals in Markov chains. The method is based on designing sequential control variates using successive approximations of the function of interest V . Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of sample trajectories of the Markov chain. Here, we obtain a geometric variance reduction O(ρN ) (with ρ < 1) up to a threshold that depends on the approximation error V − A V , where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. A V = V ), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in a policy iteration algorithm for solving Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂αV of the performance measure V with respect to some parameter α of the transition probabilities. For example, in policy parametric optimization, computing an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations for the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.

4 0.44139409 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs

Author: Anders Jonsson, Andrew Barto

Abstract: We present Variable Influence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efficient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can significantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction

5 0.40619999 58 jmlr-2006-Lower Bounds and Aggregation in Density Estimation

Author: Guillaume Lecué

Abstract: In this paper we prove the optimality of an aggregation procedure. We prove lower bounds for aggregation of model selection type of M density estimators for the Kullback-Leibler divergence (KL), the Hellinger’s distance and the L1 -distance. The lower bound, with respect to the KL distance, can be achieved by the on-line type estimate suggested, among others, by Yang (2000a). Combining these results, we state that log M/n is an optimal rate of aggregation in the sense of Tsybakov (2003), where n is the sample size. Keywords: aggregation, optimal rates, Kullback-Leibler divergence

6 0.38694453 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

7 0.37636098 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

8 0.36892653 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests

9 0.36478138 67 jmlr-2006-On Representing and Generating Kernels by Fuzzy Equivalence Relations

10 0.36210212 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs

11 0.36159423 66 jmlr-2006-On Model Selection Consistency of Lasso

12 0.36112899 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

13 0.35363179 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

14 0.35276455 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation

15 0.35150707 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity

16 0.34953606 84 jmlr-2006-Stability Properties of Empirical Risk Minimization over Donsker Classes

17 0.34707472 82 jmlr-2006-Some Theory for Generalized Boosting Algorithms

18 0.34209859 22 jmlr-2006-Considering Cost Asymmetry in Learning Classifiers

19 0.33858168 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

20 0.3319521 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)