jmlr jmlr2009 jmlr2009-73 knowledge-graph by maker-knowledge-mining

73 jmlr-2009-Prediction With Expert Advice For The Brier Game

Source: pdf

Author: Vladimir Vovk, Fedor Zhdanov

Abstract: We show that the Brier game of prediction is mixable and ﬁnd the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts. The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set. Keywords: Brier game, classiﬁcation, on-line prediction, strong aggregating algorithm, weighted average algorithm

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 UK Computer Learning Research Centre Department of Computer Science Royal Holloway, University of London Egham, Surrey TW20 0EX, England Editor: Yoav Freund Abstract We show that the Brier game of prediction is mixable and ﬁnd the optimal learning rate and substitution function for it. [sent-7, score-0.264]

2 The resulting prediction algorithm is applied to predict results of football and tennis matches, with well-known bookmakers playing the role of experts. [sent-8, score-0.56]

3 The theoretical performance guarantee is not excessively loose on the football data set and is rather tight on the tennis data set. [sent-9, score-0.389]

4 Prediction Algorithm and Loss Bound A game of prediction consists of three components: the observation space Ω, the decision space Γ, and the loss function λ : Ω × Γ → R. [sent-39, score-0.243]

5 ) The game of prediction is being played repeatedly by a learner having access to decisions made by a pool of experts, which leads to the following prediction protocol: Protocol 1 Prediction with expert advice L0 := 0. [sent-42, score-0.441]

6 N end for At each step of Protocol 1 Learner is given K experts’ advice and is required to come up with his k own decision; LN is his cumulative loss over the ﬁrst N steps, and LN is the kth expert’s cumulative loss over the ﬁrst N steps. [sent-61, score-0.251]

7 An optimal (in the sense of Theorem 1 below) strategy for Learner in prediction with expert advice for the Brier game is given by the strong aggregating algorithm (see Algorithm 1). [sent-63, score-0.4]

8 2446 P REDICTION W ITH E XPERT A DVICE F OR T HE B RIER G AME Algorithm 1 Strong aggregating algorithm for the Brier game wk := 1, k = 1, . [sent-68, score-0.247]

9 N N−1 end for Theorem 1 Using Algorithm 1 as Learner’s strategy in Protocol 1 for the Brier game guarantees that k LN ≤ min LN + ln K (1) k=1,. [sent-85, score-0.349]

10 If A < ln K, Learner does not have a strategy guaranteeing k LN ≤ min LN + A k=1,. [sent-92, score-0.196]

11 (And the seasons mentioned above were chosen because the forecasts of these bookmakers are available for them. [sent-106, score-0.222]

12 The bookmakers do not announce these numbers directly; instead, they quote three betting odds, a1 , a2 , and a3 . [sent-108, score-0.225]

13 Each number ai > 1 is the total amount which the bookmaker undertakes to pay out to a client betting on outcome i per unit stake in the event that i happens (if the bookmaker wishes to return the stake to the bettor, it should be included in ai ; i. [sent-109, score-0.342]

14 In this respect γ − 1 is similar to the overround; indeed, the approximate value of the overround is (γ − 1) ∑3 a−1 ln ai assuming that the overround is small and none of ai is too close i=1 i to 0. [sent-123, score-0.382]

15 The coefﬁcient of proportionality ∑3 a−1 ln ai can be interpreted as the entropy of the quoted i=1 i betting odds. [sent-124, score-0.338]

16 The results of applying Algorithm 1 to the football data, with 8 experts and 3 possible observak tions, are shown in Figure 1. [sent-125, score-0.346]

17 The excess loss can be negative, but from the ﬁrst part of Theorem 1 (Equation (1)) we know that it cannot be less than − ln 8; this lower bound is also shown in Figure 1. [sent-133, score-0.295]

18 We can see that at each moment in time the algorithm’s cumulative loss is fairly close to the cumulative loss of the best expert (at that time; the best expert keeps changing over time). [sent-135, score-0.36]

19 The data contain information about the winner of each match and the betting odds of 4 bookmakers for his/her win and for the opponent’s win. [sent-144, score-0.297]

20 In both Figure 1 and Figure 3 the cumulative loss of Algorithm 1 is close to the cumulative loss of the best expert. [sent-150, score-0.206]

21 The theoretical bound is not hopelessly loose for the football data and is rather tight for the tennis data. [sent-151, score-0.418]

22 The theoretical lower bound − ln 8 from Theorem 1 is also shown. [sent-156, score-0.241]

23 A vector f ∈ RΩ (understood to be a function f : Ω → R) is a superprediction if there is γ ∈ Γ such that, for all ω ∈ Ω, λ(ω, γ) ≤ f (ω); the set Σ of all superpredictions is the superprediction set. [sent-167, score-0.372]

24 (4) The image Φη (Σ) of the superprediction set will be called the η-exponential superprediction set. [sent-169, score-0.372]

25 ,K η can be guaranteed if and only if the η-exponential superprediction set is convex (part “if” for all K and part “only if” for K → ∞ are proved in Vovk, 1998; part “only if” for all K is proved by Chris Watkins, and the details can be found in Appendix A). [sent-176, score-0.186]

26 7 Figure 2: The overround distribution histogram for the football data, with 200 bins of equal size between the minimum and maximum values of the overround. [sent-185, score-0.303]

27 Deﬁne the η-exponential superprediction surface to be the part of the boundary of the ηexponential superprediction set Φη (Σ) lying inside (0, ∞)Ω . [sent-187, score-0.412]

28 Now, since the η-exponential superprediction set is convex for all η < 1, it is also convex for η = 1. [sent-192, score-0.186]

29 Let us now check that the Gauss-Kronecker curvature of the η-exponential superprediction surface is always positive when η < 1 and is sometimes negative when η > 1 (the rest of the proof, an elaboration of the above argument, will be easy). [sent-193, score-0.309]

30 A convenient parametric representation of the η-exponential superprediction surface is  x1 x2 . [sent-198, score-0.226]

31 , un−1 ∈ (0, 1) subject to u1 + · · · un−1 < 1, and un is a shorthand for 1 − u1 − · · · − un−1 . [sent-212, score-0.659]

32 8 1 Figure 4: The overround distribution histogram for the tennis data. [sent-241, score-0.241]

33 un − un−1 + 1 un − un−1 − 1 = e−2ηu1 2 un − un−1 − 1 2452 1 1 ··· 1 0 ··· 0 1 ··· . [sent-294, score-1.977]

34 The coefﬁcient in front of en is proportional to e−2ηun un − u1 + 1 un − u1 ··· un − u2 un − u2 + 1 · · · . [sent-314, score-2.655]

35 un − un−2 un − un−2 · · · un − un−1 un − un−1 · · · = e−2ηun 1 0 . [sent-323, score-2.636]

36 un − un−2 + 1 un − un−2 un − un−1 un − un−1 + 1 un − u1 un − u2 . [sent-341, score-3.954]

37 = e−2ηun 1 un − un−2 −1 un − un−1 + 1 1 0 ··· 0 1 ··· . [sent-344, score-1.318]

38 1 un − un−2 0 nun = nun e−2ηun (with the coefﬁcient of proportionality e2η (−1)n−1 ). [sent-359, score-0.738]

39 0 0 u1 e−2ηu1 u2 e−2ηu2 ··· ··· (1 − 2ηun−1 )e−2ηun−1 un−1 e−2ηun−1 (2ηun − 1)e−2ηun un e−2ηun 2453 VOVK AND Z HDANOV 1 − 2ηu1 0 0 1 − 2ηu2 . [sent-382, score-0.659]

40 If η > 1, set u1 = u2 := 1/2 and u3 = · · · = un := 0. [sent-399, score-0.659]

41 This reduces to (10) 1 1 +···+ > n t1 tn (11) if t1 · · ·tn > 0, and to 1 1 +···+ < n (12) t1 tn if t1 · · ·tn < 0. [sent-414, score-0.202]

42 Then tn−1 + tn > 0, and so 1 tn−1 + 1 < 0; tn 2454 P REDICTION W ITH E XPERT A DVICE F OR T HE B RIER G AME therefore, 1 1 1 1 +···+ + + < n − 2 < n. [sent-437, score-0.202]

43 The η-exponential superprediction surface will be oriented by choosing the normal vector ﬁeld directed towards the origin. [sent-439, score-0.226]

44 2ηun −2ηun xn e un e with both coefﬁcients of proportionality positive (cf. [sent-449, score-0.726]

45 In the case η > 1, the Gauss-Kronecker curvature is negative at some point, and so the ηexponential superprediction set is not convex (Thorpe, 1979, Chapter 13, Theorem 1 and its proof). [sent-456, score-0.269]

46 Because of the continuity of the η-exponential superprediction surface in η we can and will assume, without loss of generality, that η < 1. [sent-458, score-0.277]

47 , un−1 , η) of the η-exponential superprediction surface is always positive (among the arguments of k1 we list not only the coordinates u1 , . [sent-462, score-0.226]

48 Suppose there are two points A and B on the η-exponential superprediction surface such that the interval [A, B] contains points outside the η-exponential superprediction set. [sent-487, score-0.412]

49 In this section we will ﬁnd 2455 VOVK AND Z HDANOV a substitution function for the strong aggregating algorithm for the Brier game with η ≤ 1, which is the only component of the algorithm not described explicitly in Vovk (2001). [sent-494, score-0.272]

50 , ln )T computed by the aggregating pseudo-algorithm from a normalized distribution on the experts. [sent-502, score-0.265]

51 , ln )T is a superprediction (remember that we are assuming η ≤ 1), we are only required to ﬁnd a permitted prediction     λ1 (u1 − 1)2 + u2 + · · · + u2 n 2 λ2  u2 + (u2 − 1)2 + · · · + u2  n    1 (15)  . [sent-506, score-0.421]

52 , un = un ; in the latter case, however, we can, and will, also choose ui > 0) for which εi := ui − ui is maximal. [sent-561, score-1.435]

53 ) In this appendix we will use a slightly more general notion of a game of prediction (Ω, Γ, λ): namely, the loss function λ : Ω × Γ → R is now allowed to take values in the extended real line R := R ∪ {−∞, ∞} (although the value −∞ will be later disallowed). [sent-583, score-0.243]

54 For each K we will be interested in the set of those a > 0 for which Learner has a winning strategy in the game GK (a) (we will denote this by L ⌣ GK (a)). [sent-597, score-0.193]

55 2458 P REDICTION W ITH E XPERT A DVICE F OR T HE B RIER G AME We say that the game of prediction (Ω, Γ, λ) is η-mixable, where η > 0, if ∀γ1 ∈ Γ, γ2 ∈ Γ, α ∈ [0, 1] ∃δ ∈ Γ ∀ω ∈ Ω : e−ηλ(ω,δ) ≥ αe−ηλ(ω,γ1 ) + (1 − α)e−ηλ(ω,γ2 ) . [sent-601, score-0.192]

56 The game of prediction is mixable if it is η-mixable for some η > 0. [sent-603, score-0.231]

57 (1952, Theorem 92, applied to the means Mφ with φ(x) = e−ηx ) that if the prediction game is η-mixable it will remain η′ -mixable for any positive η′ < η. [sent-605, score-0.192]

58 ) Let η∗ be the supremum of the η for which the prediction game is η-mixable (with η∗ := 0 when the game is not mixable). [sent-607, score-0.345]

59 The compactness of Γ implies that the prediction game is η∗ -mixable. [sent-608, score-0.192]

60 In view of the fact that L ⌣ GK (ln K/η∗ ), we only need to show that L ⌣ GK (a) does not hold for a < ln K/η∗ . [sent-619, score-0.196]

61 On the other hand, the point (1, a/ ln K) is Southwest and outside of the separation curve (use Lemmas 8–12 of Vovk, 1998). [sent-623, score-0.196]

62 , Environment) has a winning strategy in the game G (1, a/ ln K). [sent-626, score-0.389]

63 It is easy to see from the proof of Theorem 1 in Vovk (1998) that the deﬁnition of the game G can be modiﬁed, without changing the conclusion about G (1, a/ ln K), by replacing the line E chooses n ≥ 1 {size of the pool} in the protocol on p. [sent-627, score-0.376]

64 46 Table 1: The bookmakers’ cumulative Brier losses over the football data set when their probability forecasts are computed using formula (3) and formula (26). [sent-655, score-0.366]

65 When the number K m of experts exceeds n∗ , we obtain a contradiction: Learner can guarantee k LN ≤ LN + ma for all N and all K m experts k, and Environment can guarantee that k LN > L N + a k ln(K m ) = LN + ma ln K for some N and k. [sent-657, score-0.468]

66 The improvement of each bookmaker’s total loss over the football data set is in the range 0. [sent-662, score-0.261]

67 92 Table 2: The bookmakers’ cumulative Brier losses over the tennis data set when their probability forecasts are computed using formula (3) and formula (26). [sent-682, score-0.304]

68 Different bookmakers (and the same bookmaker at different times) can use different functions f . [sent-683, score-0.287]

69 Therefore, different bookmakers may quote different odds because they may use different f and because they may assign different probabilities to the same event. [sent-684, score-0.208]

70 For example, introducing a new function g : (0, ∞) → (0, ∞) by g(u) := ln f (e−u ) and new variables x, y ∈ (0, ∞) by x := − ln p and y := − ln q, we transform (27) to the most standard Cauchy equation g(x + y) = g(x) + g(y). [sent-689, score-0.588]

71 If there are two events with quoted odds a and b that the bookmaker considers independent, his quoted odds on the conjunction of the two events will be ab. [sent-697, score-0.28]

72 An important advantage of (3) over (26) is that (3) does not impose any upper limits on the overround that the bookmaker may charge (Khutsishvili, 2009). [sent-702, score-0.217]

73 We can see that for the football data the maximal difference between the cumulative loss of the WdAA and the cumulative loss of the best expert is slightly larger than that for Algorithm 1 but still well within the optimal bound ln K given by (1). [sent-721, score-0.762]

74 For the tennis data the maximal difference is almost twice as large as for Algorithm 1, violating the optimal bound ln K. [sent-722, score-0.417]

75 , un ) ∈ [0, ∞)n | ∑n ui = 1}, i=1 and Reality’s move is known to be one of the vertices of this simplex. [sent-738, score-0.698]

76 20 theoretical bound for Algorithm 1 Weighted Average Algorithm experts 15 10 5 0 −5 0 2000 4000 6000 8000 10000 12000 Figure 6: The difference between the cumulative loss of each of the 4 bookmakers and of the WdAA for c := 4 on the tennis data. [sent-746, score-0.595]

77 0904 none Table 3: The maximal difference between the loss of each algorithm in the selected set and the loss of the best expert for the football data (second column); the theoretical upper bound on this difference (third column). [sent-752, score-0.478]

78 As described in Kivinen and Warmuth (1999), the WdAA is parameterized by c := 1/η instead of η, and the optimal value of c is c = 8R2 , leading to the guaranteed loss bound k LN ≤ min LN + 8R2 ln K k=1,. [sent-756, score-0.276]

79 ,8 (28) where LN (c) is the loss of the WdAA with parameter c on the football data over the ﬁrst N steps and k LN is the analogous loss of the kth expert, as a function of c. [sent-771, score-0.312]

80 2 1 0 1 2 3 4 5 parameter c 6 7 8 9 Figure 7: The maximal difference (28) for the WdAA as function of the parameter c on the football data. [sent-793, score-0.254]

81 The theoretical guarantee ln 8 for the maximal difference for Algorithm 1 is also shown (the theoretical guarantee for the WdAA, 11. [sent-794, score-0.272]

82 5 1 0 1 2 3 4 5 parameter c 6 7 8 9 Figure 8: The maximal difference (29) for the WdAA as function of the parameter c on the tennis data. [sent-799, score-0.192]

83 5 5 Figure 9: The maximal difference ((28) with η in place of c) for Algorithm 1 as function of the parameter η on the football data. [sent-809, score-0.254]

84 5 5 Figure 10: The maximal difference ((29) with η in place of c) for Algorithm 1 as function of the parameter η on the tennis data. [sent-819, score-0.192]

85 20 theoretical bound for Algorithm 1 Weighted Average Algorithm experts 15 10 5 0 −5 0 2000 4000 6000 8000 10000 12000 Figure 12: The difference between the cumulative loss of each of the 4 bookmakers and of the WdAA for c = 1 on the tennis data. [sent-821, score-0.595]

86 5452 none Table 4: The maximal difference between the loss of each algorithm in the selected set and the loss of the best expert for the tennis data (second column); the theoretical upper bound on this difference (third column). [sent-827, score-0.416]

87 The following two algorithms, the weak aggregating algorithm (WkAA) and the Hedge algorithm (HA), make increasingly weaker assumptions about the prediction game being played. [sent-828, score-0.261]

88 In the notation of (1), a simple loss bound for the WkAA is √ k LN ≤ min LN + 2L N ln K (30) k=1,. [sent-839, score-0.276]

89 ,K (Kalnishkan and Vyugin, 2008, Corollary 14); this is quite different √ from (1) as the “regret term” √ √ 2L N ln K in (30) depends on N. [sent-842, score-0.196]

90 For c = 8 ln K/L, Cesa-Bianchi and Lugosi (2006, Theorem 2. [sent-844, score-0.196]

91 3) prove the stronger bound √ k LN ≤ min LN + L 2N ln K + L k=1,. [sent-845, score-0.225]

92 Moreover, the WkAA violates the bound for Algorithm 1 for all reasonable values of c on some natural subsets of the football data set: for example, when prediction starts from the second (2006/2007) season. [sent-850, score-0.278]

93 2468 P REDICTION W ITH E XPERT A DVICE F OR T HE B RIER G AME The loss bound for the HA is E LN ≤ 1 ∗ LN ln β + L ln K (31) 1−β (Freund and Schapire, 1997, Theorem 2), where E LN stands for Learner’s expected loss (the HA ∗ k is a randomized algorithm) and LN stands for mink=1,. [sent-852, score-0.523]

94 In the same framework, the strong aggregating algorithm attains the stronger bound E LN ≤ 1 ∗ LN ln β + L ln K (32) K K ln K+β−1 (Vovk, 1998, Example 7). [sent-856, score-0.703]

95 The losses suffered by the HA and the SAA-HA on our data sets are very close and violate Algorithm 1’s regret term ln K for all values of β. [sent-875, score-0.219]

96 It is interesting that, for both football and tennis data, the loss of the HA is almost minimized by setting its parameter β to 0 (the qualiﬁcation “almost” is necessary only in the case of the tennis data). [sent-876, score-0.557]

97 The HA with β = 0 coincides with the Follow the Leader Algorithm (FLA), which chooses the same decision as the best (with the smallest loss up to now) expert; if there are several best experts (which almost never happens after the ﬁrst step), their predictions are averaged with equal weights. [sent-877, score-0.187]

98 Its empirical performance on the football data set is not so bad: it violates the loss bound for Algorithm 1 only slightly; however, on the tennis data set the bound is violated badly. [sent-882, score-0.467]

99 The decent performance of the Follow the Leader Algorithm on the football data set suggests checking the empirical performance of other similarly naive algorithms, such as the following two. [sent-883, score-0.21]

100 The Bayes Mixture Algorithm (BMA) is the strong aggregating algorithm applied to the log loss function; this algorithm is in fact optimal, but not for the Brier loss function. [sent-885, score-0.188]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('un', 0.659), ('vovk', 0.25), ('wdaa', 0.241), ('football', 0.21), ('ln', 0.196), ('superprediction', 0.186), ('bookmakers', 0.163), ('game', 0.153), ('brier', 0.148), ('tennis', 0.148), ('experts', 0.136), ('bookmaker', 0.124), ('ame', 0.101), ('hdanov', 0.101), ('rier', 0.101), ('xpert', 0.101), ('tn', 0.101), ('overround', 0.093), ('ha', 0.089), ('dvice', 0.086), ('wkaa', 0.085), ('curvature', 0.083), ('khutsishvili', 0.078), ('rediction', 0.077), ('expert', 0.077), ('aggregating', 0.069), ('betting', 0.062), ('gk', 0.061), ('kalnishkan', 0.054), ('cumulative', 0.052), ('loss', 0.051), ('proportionality', 0.047), ('league', 0.047), ('learner', 0.046), ('odds', 0.045), ('advice', 0.045), ('maximal', 0.044), ('forecasts', 0.043), ('surface', 0.04), ('winning', 0.04), ('saa', 0.04), ('prediction', 0.039), ('ui', 0.039), ('fedor', 0.039), ('mixable', 0.039), ('vyugin', 0.039), ('zhdanov', 0.039), ('substitution', 0.033), ('quoted', 0.033), ('determinant', 0.032), ('ith', 0.032), ('kivinen', 0.03), ('vladimir', 0.03), ('bound', 0.029), ('protocol', 0.027), ('win', 0.027), ('thorpe', 0.026), ('victor', 0.026), ('lugosi', 0.026), ('wk', 0.025), ('ak', 0.025), ('desantis', 0.023), ('mixability', 0.023), ('decisions', 0.023), ('losses', 0.023), ('reality', 0.022), ('haussler', 0.022), ('matches', 0.022), ('warmuth', 0.021), ('xn', 0.02), ('announces', 0.02), ('hedge', 0.02), ('formula', 0.019), ('excess', 0.019), ('front', 0.019), ('pool', 0.019), ('watkins', 0.018), ('england', 0.018), ('strong', 0.017), ('positivity', 0.017), ('leader', 0.016), ('gn', 0.016), ('theorem', 0.016), ('theoretical', 0.016), ('merge', 0.016), ('acz', 0.016), ('bets', 0.016), ('bulging', 0.016), ('darboux', 0.016), ('hardy', 0.016), ('nun', 0.016), ('pinnacle', 0.016), ('rhul', 0.016), ('seasons', 0.016), ('snowberg', 0.016), ('stake', 0.016), ('tournaments', 0.016), ('yuri', 0.016), ('pi', 0.015), ('loose', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000014 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

Author: Vladimir Vovk, Fedor Zhdanov

2 0.11004494 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models

Author: Shaowei Lin, Bernd Sturmfels, Zhiqiang Xu

Abstract: Inference in Bayesian statistics involves the evaluation of marginal likelihood integrals. We present algebraic algorithms for computing such integrals exactly for discrete data of small sample size. Our methods apply to both uniform priors and Dirichlet priors. The underlying statistical models are mixtures of independent distributions, or, in geometric language, secant varieties of SegreVeronese varieties. Keywords: marginal likelihood, exact integration, mixture of independence model, computational algebra

3 0.081003204 49 jmlr-2009-Learning Permutations with Exponential Weights

Author: David P. Helmbold, Manfred K. Warmuth

Abstract: We give an algorithm for the on-line learning of permutations. The algorithm maintains its uncertainty about the target permutation as a doubly stochastic weight matrix, and makes predictions using an efﬁcient method for decomposing the weight matrix into a convex combination of permutations. The weight matrix is updated by multiplying the current matrix entries by exponential factors, and an iterative procedure is needed to restore double stochasticity. Even though the result of this procedure does not have a closed form, a new analysis approach allows us to prove an optimal (up to small constant factors) bound on the regret of our algorithm. This regret bound is signiﬁcantly better than that of either Kalai and Vempala’s more efﬁcient Follow the Perturbed Leader algorithm or the computationally expensive method of explicitly representing each permutation as an expert. Keywords: permutation, ranking, on-line learning, Hedge algorithm, doubly stochastic matrix, relative entropy projection, Sinkhorn balancing

4 0.048865084 67 jmlr-2009-Online Learning with Sample Path Constraints

Author: Shie Mannor, John N. Tsitsiklis, Jia Yuan Yu

Abstract: We study online learning where a decision maker interacts with Nature with the objective of maximizing her long-term average reward subject to some sample path average constraints. We deﬁne the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature’s choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint, the convex hull turns out to be the highest attainable function. Using a calibrated forecasting rule, we provide an explicit strategy that attains this convex hull. We also measure the performance of heuristic methods based on non-calibrated forecasters in experiments involving a CPU power management problem. Keywords: online learning, calibration, regret minimization, approachability

5 0.039519597 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

Author: Wenxin Jiang

Abstract: The statistical learning theory of risk minimization depends heavily on probability bounds for uniform deviations of the empirical risks. Classical probability bounds using Hoeffding’s inequality cannot accommodate more general situations with unbounded loss and dependent data. The current paper introduces an inequality that extends Hoeffding’s inequality to handle these more general situations. We will apply this inequality to provide probability bounds for uniform deviations in a very general framework, which can involve discrete decision rules, unbounded loss, and a dependence structure that can be more general than either martingale or strong mixing. We will consider two examples with high dimensional predictors: autoregression (AR) with ℓ1 -loss, and ARX model with variable selection for sign classiﬁcation, which uses both lagged responses and exogenous predictors. Keywords: dependence, empirical risk, probability bound, unbounded loss, uniform deviation

6 0.037665699 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

7 0.036960602 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

8 0.036561687 48 jmlr-2009-Learning Nondeterministic Classifiers

9 0.035721164 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

10 0.035637274 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

11 0.027808195 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

12 0.023748726 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

13 0.02319243 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

14 0.021724371 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

15 0.021236518 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

16 0.019836791 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

17 0.018558921 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

18 0.017725131 50 jmlr-2009-Learning When Concepts Abound

19 0.016929183 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks

20 0.0161066 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.105), (1, -0.045), (2, -0.01), (3, 0.002), (4, -0.051), (5, 0.047), (6, -0.082), (7, 0.076), (8, -0.017), (9, 0.006), (10, 0.067), (11, -0.042), (12, -0.131), (13, 0.036), (14, -0.108), (15, 0.276), (16, 0.005), (17, -0.285), (18, -0.265), (19, -0.289), (20, 0.153), (21, 0.114), (22, 0.116), (23, 0.161), (24, 0.023), (25, 0.19), (26, -0.048), (27, 0.188), (28, 0.088), (29, 0.038), (30, 0.017), (31, -0.005), (32, -0.074), (33, -0.105), (34, 0.09), (35, 0.072), (36, -0.011), (37, -0.031), (38, 0.054), (39, 0.067), (40, 0.099), (41, -0.014), (42, -0.013), (43, -0.027), (44, -0.1), (45, 0.053), (46, 0.075), (47, 0.158), (48, -0.033), (49, -0.221)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98134452 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

Author: Vladimir Vovk, Fedor Zhdanov

2 0.5669114 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models

Author: Shaowei Lin, Bernd Sturmfels, Zhiqiang Xu

3 0.35986963 49 jmlr-2009-Learning Permutations with Exponential Weights

Author: David P. Helmbold, Manfred K. Warmuth

4 0.22310019 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

Author: Tong Zhang

Abstract: This paper studies the feature selection problem using a greedy least squares regression algorithm. We show that under a certain irrepresentable condition on the design matrix (but independent of the sparse target), the greedy algorithm can select features consistently when the sample size approaches inﬁnity. The condition is identical to a corresponding condition for Lasso. Moreover, under a sparse eigenvalue condition, the greedy algorithm can reliably identify features as long as each nonzero coefﬁcient is larger than a constant times the noise level. In compar√ ison, Lasso may require the coefﬁcients to be larger than O( s) times the noise level in the worst case, where s is the number of nonzero coefﬁcients. Keywords: greedy algorithm, feature selection, sparsity

5 0.19440302 21 jmlr-2009-Data-driven Calibration of Penalties for Least-Squares Regression

Author: Sylvain Arlot, Pascal Massart

Abstract: Penalization procedures often suffer from their dependence on multiplying factors, whose optimal values are either unknown or hard to estimate from data. We propose a completely data-driven calibration algorithm for these parameters in the least-squares regression framework, without assuming a particular shape for the penalty. Our algorithm relies on the concept of minimal penalty, recently introduced by Birg´ and Massart (2007) in the context of penalized least squares for Gaussian hoe moscedastic regression. On the positive side, the minimal penalty can be evaluated from the data themselves, leading to a data-driven estimation of an optimal penalty which can be used in practice; on the negative side, their approach heavily relies on the homoscedastic Gaussian nature of their stochastic framework. The purpose of this paper is twofold: stating a more general heuristics for designing a datadriven penalty (the slope heuristics) and proving that it works for penalized least-squares regression with a random design, even for heteroscedastic non-Gaussian data. For technical reasons, some exact mathematical results will be proved only for regressogram bin-width selection. This is at least a ﬁrst step towards further results, since the approach and the method that we use are indeed general. Keywords: data-driven calibration, non-parametric regression, model selection by penalization, heteroscedastic data, regressogram

6 0.19404987 48 jmlr-2009-Learning Nondeterministic Classifiers

7 0.17217331 67 jmlr-2009-Online Learning with Sample Path Constraints

8 0.15971871 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

9 0.13253522 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

10 0.13047044 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

11 0.12178914 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

12 0.11436684 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

13 0.10391856 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

14 0.099077456 46 jmlr-2009-Learning Halfspaces with Malicious Noise

15 0.091088928 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages (Special Topic on Mining and Learning with Graphs and Relations)

16 0.091054514 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

17 0.088071175 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

18 0.085602283 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

19 0.083604254 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

20 0.082448378 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.021), (19, 0.013), (38, 0.03), (47, 0.048), (52, 0.027), (55, 0.043), (58, 0.029), (66, 0.152), (68, 0.011), (80, 0.444), (90, 0.045), (96, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72501802 73 jmlr-2009-Prediction With Expert Advice For The Brier Game

Author: Vladimir Vovk, Fedor Zhdanov

2 0.35543448 14 jmlr-2009-CarpeDiem: Optimizing the Viterbi Algorithm and Applications to Supervised Sequential Learning

Author: Roberto Esposito, Daniele P. Radicioni

Abstract: The growth of information available to learning systems and the increasing complexity of learning tasks determine the need for devising algorithms that scale well with respect to all learning parameters. In the context of supervised sequential learning, the Viterbi algorithm plays a fundamental role, by allowing the evaluation of the best (most probable) sequence of labels with a time complexity linear in the number of time events, and quadratic in the number of labels. In this paper we propose CarpeDiem, a novel algorithm allowing the evaluation of the best possible sequence of labels with a sub-quadratic time complexity.1 We provide theoretical grounding together with solid empirical results supporting two chief facts. CarpeDiem always ﬁnds the optimal solution requiring, in most cases, only a small fraction of the time taken by the Viterbi algorithm; meantime, CarpeDiem is never asymptotically worse than the Viterbi algorithm, thus conﬁrming it as a sound replacement. Keywords: Viterbi algorithm, sequence labeling, conditional models, classiﬁers optimization, exact inference

3 0.35421503 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

Author: Jun Zhu, Eric P. Xing

Abstract: The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classiﬁcation rules to a new structural maxentropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefﬁcients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefﬁcients. 3) It admits ﬂexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefﬁcients, it subsumes the wellknown maximum margin Markov networks (M3 N) as a special case, and leads to a model similar to an L1 -regularized M3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the ﬁrst successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbi

4 0.34822246 16 jmlr-2009-Classification with Gaussians and Convex Loss

Author: Dao-Hong Xiang, Ding-Xuan Zhou

Abstract: This paper considers binary classiﬁcation algorithms generated from Tikhonov regularization schemes associated with general convex loss functions and varying Gaussian kernels. Our main goal is to provide fast convergence rates for the excess misclassiﬁcation error. Allowing varying Gaussian kernels in the algorithms improves learning rates measured by regularization error and sample error. Special structures of Gaussian kernels enable us to construct, by a nice approximation scheme with a Fourier analysis technique, uniformly bounded regularizing functions achieving polynomial decays of the regularization error under a Sobolev smoothness condition. The sample error is estimated by using a projection operator and a tight bound for the covering numbers of reproducing kernel Hilbert spaces generated by Gaussian kernels. The convexity of the general loss function plays a very important role in our analysis. Keywords: reproducing kernel Hilbert space, binary classiﬁcation, general convex loss, varying Gaussian kernels, covering number, approximation

5 0.34479815 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider regularized support vector machines (SVMs) and show that they are precisely equivalent to a new robust optimization formulation. We show that this equivalence of robust optimization and regularization has implications for both algorithms, and analysis. In terms of algorithms, the equivalence suggests more general SVM-like algorithms for classiﬁcation that explicitly build in protection to noise, and at the same time control overﬁtting. On the analysis front, the equivalence of robustness and regularization provides a robust optimization interpretation for the success of regularized SVMs. We use this new robustness interpretation of SVMs to give a new proof of consistency of (kernelized) SVMs, thus establishing robustness as the reason regularized SVMs generalize well. Keywords: robustness, regularization, generalization, kernel, support vector machine

6 0.34425005 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

7 0.34265777 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

8 0.34132406 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes

9 0.34049967 49 jmlr-2009-Learning Permutations with Exponential Weights

10 0.34023851 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis

11 0.33617103 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

12 0.33614179 78 jmlr-2009-Refinement of Reproducing Kernels

13 0.33459458 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

14 0.33368319 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm (Special Topic on Mining and Learning with Graphs and Relations)

15 0.33269006 18 jmlr-2009-Consistency and Localizability

16 0.3322593 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

17 0.33156735 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

18 0.33151913 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

19 0.33051616 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

20 0.32991707 94 jmlr-2009-The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs