jmlr jmlr2005 jmlr2005-15 knowledge-graph by maker-knowledge-mining

15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks


Source: pdf

Author: Dmitry Rusakov, Dan Geiger

Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 IL Computer Science Department Technion - Israel Institute of Technology Haifa, 32000, Israel Editor: David Madigan Abstract We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. [sent-7, score-0.509]

2 This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. [sent-10, score-0.364]

3 The quantity represented by S(N,YD , M) ≡ ln I[N,YD , M] is called the BIC score of model M. [sent-27, score-0.445]

4 It was shown that S(N,YD , M) = N · ln P(YD |ωML ) − d ln N + R, 2 (2) where ln P(YD |ωML ) is the log-likelihood of YD given the maximum likelihood parameters of the model and d is the model dimension, i. [sent-31, score-1.109]

5 Moreover, no uniform score formula exists for such models; our adjusted BIC score changes depending on the different types of singularities of the sufficient statistics, namely, the coefficient of the ln N term (Eq. [sent-61, score-0.622]

6 An additional result presented in Theorem 5 describes the asymptotic marginal likelihood given a degenerate (missing links) naive Bayesian model; it complements the main result presented by Theorem 4. [sent-63, score-0.414]

7 Section 3 reviews naive Bayesian models and explicates the relevant marginal likelihood integrals for these models. [sent-66, score-0.371]

8 The objective of this paper is deriving asymptotic approximation of marginal likelihood integrals as represented by Eq. [sent-83, score-0.349]

9 3) can be determined by the poles of Z Jw0 (λ) = Wε [ f (w) − f (w0 )]λ µ(w)dw evaluated in the neighborhoods Wε of points w0 on which f attains its minimum. [sent-146, score-0.347]

10 The resolution of singularities in algebraic geometry transforms the integral J(λ) into a direct product of integrals of a single variable. [sent-151, score-0.363]

11 Theorems 2 and 3 provide an approach for computing the leading terms in the asymptotic expansion of ln I[N,YD ]: 1. [sent-187, score-0.461]

12 Note that in the process of resolution of singularities Uα may be further divided into subregions Uαβ by neighborhoods of different points p ∈ Uα , as specified by Theorem 3. [sent-192, score-0.348]

13 The largest pole and multiplicity of J(λ) are λ(αβ)∗ = max(αβ) λαβ and the corresponding multiplicity m(αβ)∗ . [sent-205, score-0.451]

14 1 2 1 3 2 3 (8) To find the poles of J(λ) we transform the integrand function into a more convenient form by changing to new coordinates via the process of resolution of singularities. [sent-212, score-0.337]

15 The following mapping, named T , relates these two sets of parameters: n n i=1 i=1 θx = t ∏ axi (1 − ai )1−xi + (1 − t) ∏ bxi (1 − bi )1−xi , i i 8 (10) A SYMPTOTIC M ODEL S ELECTION FOR NAIVE BAYESIAN N ETWORKS and the marginal likelihood integral (Eq. [sent-253, score-0.418]

16 10 and 11) be the marginal likelihood of data with averaged sufficient statistics YD given the naive Bayesian model with binary variables and two hidden states with parameters ω = (a, b,t). [sent-319, score-0.401]

17 Moreover, for n = 2, S = ϒ0 = ϒ and (d) If Y ∈ S (namely, Y ∈ S \ S ), 3 ln I[N,YD ] = N ln P(Y |ωML ) − ln N + O(1), 2 (e) If Y ∈ S , (17) 3 ln I[N,YD ] = N ln P(Y |ωML ) − ln N + 2 ln ln N + O(1), 2 (18) and for n = 1, (f) 1 ln I[N,YD ] = N ln P(Y |ωML ) − ln N + O(1), 2 (19) as N → ∞. [sent-340, score-3.586]

18 In contrast to the standard BIC score, which is uniform for all points YD , the asymptotic approximation given by our adjusted BIC score depends on the value of Y = YD through the coefficient of ln N. [sent-350, score-0.617]

19 Another attempt is to claim that the coefficient of − ln N is half the number of parameters in the naive Bayesian model minus the number of redundant parameters in the model that generates YD . [sent-356, score-0.605]

20 The next theorem specifies the asymptotic behavior of marginal likelihood integrals for degenerate naive Bayesian models, namely, when some of the links are missing. [sent-361, score-0.585]

21 Namely, I[N,YD ] = N ∑x Yx ln θx (ω) µ(ω)dω, (0,1)2n+1 e R (20) xi n−m n−m 1−xi , θx = t ∏i=1 axi (1 − ai )1−xi + (1 − t) ∏i=1 bxi (1 − bi )1−xi ∏n i=n−m+1 ci (1 − ci ) i i where x = (x1 , . [sent-374, score-0.567]

22 Furthermore, for m = n − 2 (d) If k = 2 (type 1 singularity) ln I[N,YD ] = N ln P(Y |ωML ) − n+1 ln N + O(1). [sent-393, score-0.978]

23 (e) If k = 0 or k = 1 (type 2 singularity) ln I[N,YD ] = N ln P(Y |ωML ) − n+1 ln N + 2 ln ln N + O(1), 2 and for m = n − 1 or m = n, (f) n ln I[N,YD ] = N ln P(Y |ωML ) − ln N + O(1), 2 regardless of k as N → ∞. [sent-395, score-2.608]

24 An adversary may argue that evaluating the marginal likelihood on singular points is not needed because one could exclude from the model all singular points which only have measure zero. [sent-396, score-0.342]

25 Second, the sets of extremum points of the exponent (maximum log-likelihood points) are found, and then the new integral is computed in the neighborhoods of extremum points. [sent-434, score-0.465]

26 24 obtaining νi = xi , νi j = xi x j + (1 − s2 )ui u j , νi jk = xi x j xk + (1 − s2 )(xi u j uk + ui x j uk + ui u j xk ) − 2s(1 − s2 )ui u j uk ν12. [sent-493, score-0.842]

27 Then we subtract products of the first n coordinates with one of the new coordinates to remove the second terms, namely, zi jk = νi jk − νi ν j νk − zi j νk − zik ν j − z jk νi , and so forth. [sent-499, score-0.84]

28 We end up with the transformation T32 : Λ → Λ defined by zi = νi , zi j = νi j − νi ν j , zi jk = νi jk − νi ν j νk − zi j νk − zik ν j − z jk νi , etc. [sent-500, score-1.03]

29 Transformation T2 : We define T2 : U ⊂ R2n+1 → Λ ⊂ R2 n −1 via zi = xi , zi j = p2 (s)ui u j , . [sent-515, score-0.397]

30 The first lemma states that under Assumptions A1 and A3, the integral I[N,YD ] can be asymptotically evaluated in the (x, u, s) coordinates for a limiting statistics Y , while dismissing the contribution of the density function µ. [sent-536, score-0.391]

31 Then, fY = P(Y |ωML ) and ˜ ln I[N,YD ] = N fY + ln I[N,Y ] + O(1) (30) for all N > 1. [sent-541, score-0.652]

32 ) If Y is positive (A2) and Y ∈ ϒ0 , then ˜ ln I[N,Y ] = ln J[N,Y ] + O(1) for all N > 1. [sent-552, score-0.652]

33 Lemmas 6 and 7 jointly state that the asymptotic forms of ln J[N,Y ] and ln I[N,YD ] are identical up to an additive term N fY and a constant provided that Y is the limiting statistics of YD (Assumption A3). [sent-556, score-0.814]

34 The zero set U0 can be written as the union of n + 2 sets ¯ ¯ U0 = U0− ∪ U0+ ∪ n [ ¯ U0 j , (34) j=1 where −γi 1−γi 2 , 2 U0− = (x = γ, u, s = −1) | ui ∈ U0+ = (x = γ, u, s = 1) | ui ∈ γi −1 γi 2 , 2 , i = 1, . [sent-585, score-0.398]

35 Consequently, one could perhaps guess from the classical Laplace approximation analysis that because the zero subsets U0− , U0+ have dimension n, the coefficient of the ln N term would be at least −(2n + 1 − n)/2 = −(n + 1)/2. [sent-603, score-0.357]

36 27), we obtain φ(x, u, s) = ∑I [zI (x + x, u + u, s + s) − zI ]2 = ∑i [zi − zi ]2 + ∑i j,i= j [zi j − zi j ]2 + ∑i jk,i= j=k [zi jk − zi jk ]2 + . [sent-626, score-0.699]

37 For example, third terms are of form (zi jk − zi jk )2 = 4(s + s)2 (1 − (s + s)2 )2 u2 u2 u2 ≤ 5ε2 u2 u2 for all s, ui , u j , uk < ε for ε small enough. [sent-632, score-0.624]

38 Consequently, due to Lemmas 6 and 7, ln I[N,YD ] = N · P(Y |ωML ) − n+1 2 ln N + O(1), as claimed by Theorem 4c. [sent-650, score-0.652]

39 Discussion This paper presents an asymptotic approximation of the marginal likelihood of data given a naive Bayesian model with binary variables (Theorem 4). [sent-652, score-0.448]

40 This Theorem proves that the classical BIC score that penalizes the log-likelihood of a model by d ln N is incorrect for Bayesian networks with 2 hidden variables and suggests an adjusted BIC score. [sent-653, score-0.572]

41 Nevertheless, this Theorem is an essential advance towards developing asymptotic Bayesian methods for model selection among naive Bayesian models in particular, and for Bayesian networks with hidden variables in general. [sent-661, score-0.387]

42 Develop a closed form asymptotic formula for marginal likelihood integrals for all types of statistics Y given an arbitrary latent Bayesian model. [sent-664, score-0.376]

43 Develop an algorithm that, given a Bayesian network with hidden variables and a data set with statistics YD , determines the possible singularity types of the limit statistics Y and applies the appropriate asymptotic formula developed in step 2. [sent-669, score-0.378]

44 15, 16), one can see that a naive Bayesian network with all links present is somewhat under-evaluated using the classical BIC score for singular statistics Y because the penalty term reduces from (2n + 1)/2 in the classical score to (2n − 1)/2 (or (n + 1)/2) in the adjusted score. [sent-673, score-0.55]

45 This results in complex surfaces of maximum likelihood points in the model parameter space and, consequently, a non-standard behavior of marginal likelihood integrals which we have started to explore in this paper. [sent-685, score-0.347]

46 Lemma 8 Let n f (x, u, s) = fY − ∑2 Yi ln θi [x, u, s], i=1 (38) n 2 −1 θ[x, u, s] = (T3 ◦ T2 )[x, u, s], θ2n [x, u, s] = 1 − ∑i=1 θi [x, u, s], n where fY = max(x,u,s)∈U ∑2 Yi ln θi [x, u, s] and Y = (Y1 , . [sent-698, score-0.652]

47 Then, ˜ ln I p0 [N,Y ] = ln J p0 [N] + O(1) for all N > 1. [sent-710, score-0.652]

48 38, namely, n f (x, u, s) = fY − ∑2 Yi ln θi [x, u, s], i=1 n 2 −1 θ[x, u, s] = (T3 ◦ T2 )[x, u, s], θ2n [x, u, s] = 1 − ∑i=1 θi [x, u, s], n where fY = max(x,u,s)∈U ∑2 Yi ln θi [x, u, s] and Y = (Y1 , . [sent-722, score-0.652]

49 The neighborhoods that do not contain minimum points can be discarded since their contribution to the integral is exponentially small, i. [sent-775, score-0.396]

50 , 2n and Y ∈ ϒ0 \S, then asymptotic approximation of ln I[N,YD ] (Eq. [sent-792, score-0.461]

51 13) equals N ln P(Y |wML )− 2n+1 ln N + 2 ˜ O(1) (Eq. [sent-793, score-0.652]

52 That is because the integrand function eN ∑x Yx ln θx (w) = ∏x θx (w)NYx satisfies 0 ≤ θx (w)NYx ≤ 1 for all N, YD , i and w = (a, b,t) ∈ Ω and because µ(a, b,t) is a probability density function on Ω, thus integral I[N,YD ] ˜ is finite (and less than 1). [sent-797, score-0.527]

53 Since the value of e−N f (x,u,s) outside the small neighborhoods of the minimums f is exponentially small, so the asymptotic behavior of ˜ ˜ I[N,Y ] on U is actually described by integration of I[N,Y ] in the small neighborhoods of minimums ˜ of f (Lemma 7). [sent-800, score-0.58]

54 Since I[N,Y ] converges and Claims B1, B3 and B4 of Lemma 9 hold, it follows ˜ that in sufficiently small neighborhoods of the two internal minimum points of f , the integral I[N,Y ] can be computed by Lemma 1 (Laplace Approximation). [sent-801, score-0.383]

55 ˜ Consequently, integrating I[N,Y ] in the full neighborhoods of the maximum likelihood points (x , u , s ) ∈ U0 , that lie on the border of U, introduces only a constant multiplicative errors to the ˜ approximation. [sent-802, score-0.367]

56 13) is asymptotically equal to NP(Y |wML ) − 2n−1 ln N + O(1) (Eq. [sent-817, score-0.363]

57 , n, i = l, k, ul , uk , s, such that (1 − s2 )ul uk = zlk       . [sent-835, score-0.488]

58 (43)      Note that zlk = 0 and ul , uk = 0, s = ±1 for (x , u , s ) ∈ U0 , because Y ∈ S . [sent-836, score-0.396]

59 The coordinates of z = T2 (x , u , s ) are zi = xi for all i, zlk = (1 − s 2 )ul uk and all other zI ’s are zero. [sent-840, score-0.485]

60 origin, yields φ(x, u, s) = ∑I [zI (x + x, u + u, s + s) − zI ]2 = ∑i [zi (x + x, u + u, s + s) − zi ]2 2 + zlk (x + x, u + u, s + s) − zlk + ∑i=l,k zil (x + x, u + u, s + s) − zil 2 + ∑i, j=l,k zi j (x + x, u + u, s + s) − zi j + zik (x + x, u + u, s + s) − zik 2 2 +. [sent-866, score-0.877]

61 = ∑n [(xi + xi ) − xi ]2 i=1 2 + (1 − (s + s)2 )(ul + ul )(uk + uk ) − (1 − s 2 )ul uk 2 + ∑i=l,k (1 − (s + s)2 )(ul + ul )ui − 0 + (1 − (s + s)2 )(uk + uk )ui − 0 2 + ∑i, j=l,k (1 − (s + s)2 )ui u j − 0 + . [sent-869, score-0.764]

62 (44) 2 = ∑n xi2 i=1 2 + −2s ul uk s + (1 − s 2 )uk ul + (1 − s 2 )ul uk + “smaller terms 2 2 + ∑i=l,k (1 − s 2 )ul ui + “smaller terms + (1 − s 2 )uk ui + . [sent-872, score-1.008]

63 In 25 RUSAKOV AND G EIGER particular, the term zlk (x, u, s) − zlk is rewritten via zlk (x + x, u + u, s + s) − zlk = (1 − (s + s )2 )(ul + ul )(uk + uk ) − (1 − s 2 )ul uk = −(2s + s)ul uk s + ((1 − s 2 ) − 2s s − s2 )(uk + uk )ul +((1 − s 2 ) − 2s s − s2 )ul uk . [sent-880, score-1.037]

64 31) for p0 = (x , u , s ) with s = 0, it remains to approximate the integral ˜ ˜ J1 [N] = e−N φ1 (x,u,s) dxduds, ˜ ˜ ˜ ˜ where φ1 (x, u, s) = ∑i xi2 + C1 s + C2 ul + C3 uk R 2 + ∑i=l,k ci u2 ˜ i (45) ˜ ˜ ˜ and where C1 , C2 , C3 and ci are non-zero constants. [sent-883, score-0.515]

65 45, by changing the coordinates to v = C1 s + C2 ul + C3 uk , we obtain that in the neighborhoods of the points in U0 with s = 0, that f can be described by quadratic form in 2n − 1 2n−1 variables, so their contribution to J[N,Y ] is cN 2 . [sent-888, score-0.655]

66 Integrating out xi and 2n−2 ˜ ui variables yields N − 2 multiplicative factor to the asymptotic approximation of J2 [N], leaving R −N[C s2 +C u +C u ]2 ˆ1 ˆ2 i ˆ3 j us to compute of the contribution of e ds dui du j . [sent-890, score-0.555]

67 Thus, the first integral contributes a pole at λ = − 4 with multiplicity 1. [sent-903, score-0.441]

68 Hence, the largest pole of J(λ) is λ = − 2 , with multiplicity m = 1 and the overall contribution of the neighborhoods of points p0 with s = 0 to 2n−1 J[N,Y ] is again cN − 2 , and it is the same as for points p0 for which s = 0. [sent-909, score-0.605]

69 We have φ(x, u, s) = ∑I [zI (x + x, u + u, s + s) − zI ]2 = ∑i [zi (x + x, u + u, s + s) − zi ]2 + ∑i, j [zi j (x + x, u + u, s + s) − zi j ]2 + ∑i, j,k [zi jk (x + x, u + u, s + s) − zi jk ]2 + . [sent-947, score-0.699]

70 = ∑i [(xi + xi ) − xi ]2 + ∑i, j (1 − (s + s)2 )(ui + ui )(u j + u j ) − 0 2 + ∑i, j,k −2(s + s)(1 − (s + s)2 )(ui + ui )(u j + u j )(uk + uk ) − 0 2 = ∑i xi + ∑i, j −2s ui u j s + “smaller terms + ∑i, j,k 4ui u j uk s + “smaller terms 2 2 +. [sent-950, score-0.874]

71 i=1 The standard change of variables to ui = u1 ui for i = 2, . [sent-964, score-0.398]

72 i 1 i=2 1 Thus the largest pole of J(λ) (for n > 2) is λ = − 2 with multiplicity m = 1 and the contribution of n+1 − the neighborhood of this (x , u , s ) is cNT 2 . [sent-968, score-0.424]

73 We have φ(x, u, s) = ∑I [zI (x + x, u + u, s + s) − zI ]2 = ∑i [zi (x + x, u + u, s + s) − zi ]2 + ∑i, j [zi j (x + x, u + u, s + s) − zi j ]2 + ∑i, j,k [zi jk (x + x, u + u, s + s) − zi jk ]2 + . [sent-975, score-0.699]

74 2 = ∑i [(xi + xi ) − xi ]2 + ∑i, j (1 − (s + s)2 )ui u j − 0 2 + ∑i, j,k −2(s + s)(1 − (s + s)2 )ui u j uk − 0 + “higher order terms (49) 2 2 = ∑i xi + ∑i, j 2sui u j − s2 ui u j + ∑i, j,k [4sui u j uk + “smaller terms ]2 + “higher order terms 2 ≈ ∑i xi + s2 ∑i, j u2 u2 . [sent-978, score-0.507]

75 The multiplicity of the pole λ = − 2 is 2 n+1 one and so the contribution of the neighborhoods of (x , 0, ±1) is cN − 2 . [sent-1004, score-0.533]

76 The interesting fact about the last two cases is that in the neighborhood of U0− and U0+ the growth of the function φ is dominated by s2 and thus the multiplicity of the maximal pole of J(λ) is always one and the ln ln N terms do not appear in the approximation of ln J p0 [N]. [sent-1006, score-1.351]

77 Thus, ln J[N,Y ] = − n+1 ln N + O(1) and due to Lemmas 6 and 7, ln I[N,YD ] = NP(Y |wML ) − n+1 ln N + O(1) as 2 2 claimed. [sent-1010, score-1.304]

78 13) is asymptotically equal to NP(Y |wML ) − 2 ln N + O(1) (Eq. [sent-1016, score-0.363]

79 17) for Y ∈ S and 3 asymptotically equal to NP(Y |wML ) − 2 ln N + 2 ln ln N + O(1) (Eq. [sent-1017, score-1.015]

80 Generally, the intersection points of zero surfaces of the same dimension are expected to give rise to a ln ln N term. [sent-1040, score-0.688]

81 , see example in Section 2, the ln ln N term does appear now. [sent-1043, score-0.652]

82 Integrating out the xi2 terms 1 2 R we obtain through the analysis of the poles of J(λ) = u2λ u2λ du1 du2 that the largest pole of 1 2 J(λ) is λ = −1/2 with multiplicity m = 2. [sent-1046, score-0.443]

83 Thus the contribution of this region to J[N,Y ] is cN −3/2 ln N. [sent-1047, score-0.377]

84 Similarly to the case 2 1 C2, the contribution of this region to J[N,Y ] is cN −3/2 ln N. [sent-1050, score-0.377]

85 The largest pole is λ = −1/2 1 2 with multiplicity m = 3, and thus the contribution of this region to J[N,Y ], including the factors from integrating out the xi ’s, is cN −3/2 ln2 N. [sent-1054, score-0.389]

86 Summarizing the contributions of the neighborhoods of various critical points for Y ∈ S , we see 3 that J[N,Y ] ∼ cN −3/2 ln2 N and, consequently, ln I[N,Y ] = N fY − 2 ln N + 2 ln ln N + O(1). [sent-1055, score-1.515]

87 13) is 1 asymptotically equal to NP(Y |wML )− 2 ln N +O(1) (Eq. [sent-1058, score-0.363]

88 Moreover, according to Theorem 2 the asymptotic R form of the in2 tegral Jp0 [N] = Uε e−N(z1 (x,u,s)−z1 ) dx du ds is determined by the poles of J(λ) = Uε (z1 (x, u, s) − 2 z1 )2λ dx du ds, where, in this case, z1 (x, u, s) − z1 = x1 . [sent-1066, score-0.501]

89 Thus, the largest pole of J(λ) is λ = −1/2 1 with multiplicity m = 1 and ln I[N,YD ] is asymptotically equal to N fY − 2 ln N + O(1). [sent-1068, score-0.996]

90 Ignoring the density µ(a, b,t) by using the assumption of bounded density (A1) and changing the variables to x = at + b(1 −t), we rewrite I[N,Y ] is asymptotically equivalent form ˜ I[N,Y ] = Z 1Z 1 0 0 1 b−a Consider now I1 [N,Y ] = Z b Z b eN(Y0 ln x+Y1 ln[1−x]) dx da db. [sent-1072, score-0.397]

91 a eN(Y0 ln x+Y1 ln(1−x)) dx a for some 0 ≤ a < b ≤ 1 (the case b > a is symmetric). [sent-1073, score-0.36]

92 In this case f (Y0 + x) = x x f (Y0 ) +Y0 ln 1 + Y0 + (1 −Y0 ) ln 1 − 1−Y0 x2 x 3 Y0 − 2Y02 + O(x ) 1 2 3 2Y0 (1−Y0 ) x + O(x ). [sent-1080, score-0.652]

93 Since the later region has a non-zero Lebesgue measure, it follows that I[N,Y ] ∼ ceN fY N −1/2 1 and ln I[N,Y ] = N fY − 2 ln N + O(1). [sent-1099, score-0.652]

94 Proof of Theorem 5 Theorem 5 states the asymptotic approximation for the marginal likelihood given a degenerate binary naive Bayesian model M that has m missing links. [sent-1101, score-0.448]

95 We have ∑x Yx ln θx (ω) 1 N ψ(a, b,t, c) = = ∑x Yx ln θ(x1 ,. [sent-1107, score-0.652]

96 ,xn−m ) (a, b,t) + ∑n i=n−m+1 (xi ln ci + (1 − xi ) ln(1 − ci )) = ∑(x1 ,. [sent-1110, score-0.433]

97 ,xn ) Yx + ∑n i=n−m+1 (∑x Yx xi ln ci + ∑x Yx (1 − xi ) ln(1 − ci )) = ∑(x1 ,. [sent-1119, score-0.464]

98 ,xn−m ) (a, b,t) + ∑n i=n−m+1 (Yi ln ci + (1 −Yi ) ln(1 − ci )) where (x1 , . [sent-1128, score-0.402]

99 20) can be rewritten as ˆ I[N,YD ] ∼ I[N,Y ] = n ∏ Z 1 i=n−m+1 0 cNYi (1 − ci )N(1−Yi ) dci i Z (0,1)2n−2m+1 eN ∑x˜ Yx˜ ln θx˜ (ω) dω. [sent-1152, score-0.364]

100 2 ˆ Hence, the contribution of the first m integrals to ln I[N,Y ] is N ln p(Yn−m+1 , . [sent-1159, score-0.795]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ln', 0.326), ('yd', 0.303), ('bic', 0.274), ('ul', 0.213), ('rusakov', 0.199), ('ui', 0.199), ('zi', 0.183), ('neighborhoods', 0.175), ('pole', 0.163), ('naive', 0.157), ('eiger', 0.154), ('symptotic', 0.154), ('bayesian', 0.148), ('multiplicity', 0.144), ('poles', 0.136), ('asymptotic', 0.135), ('integral', 0.134), ('fy', 0.13), ('election', 0.129), ('singularity', 0.128), ('odel', 0.104), ('geiger', 0.098), ('ml', 0.097), ('etworks', 0.096), ('uk', 0.092), ('integrals', 0.092), ('watanabe', 0.091), ('singularities', 0.091), ('zlk', 0.091), ('coordinates', 0.088), ('yx', 0.088), ('score', 0.085), ('jk', 0.075), ('zik', 0.073), ('integrand', 0.067), ('neighborhood', 0.066), ('wml', 0.063), ('likelihood', 0.063), ('cn', 0.063), ('hidden', 0.061), ('du', 0.061), ('transformations', 0.059), ('marginal', 0.059), ('singular', 0.057), ('en', 0.056), ('border', 0.055), ('lemma', 0.054), ('claim', 0.054), ('diffeomorphism', 0.053), ('laplace', 0.051), ('contribution', 0.051), ('bi', 0.05), ('resolution', 0.046), ('dsdt', 0.045), ('isosurfaces', 0.045), ('claims', 0.043), ('jacobian', 0.043), ('links', 0.042), ('consequently', 0.041), ('integration', 0.041), ('extremum', 0.04), ('exponent', 0.04), ('dan', 0.04), ('ds', 0.04), ('ci', 0.038), ('internal', 0.038), ('multiplicative', 0.038), ('planes', 0.038), ('theorem', 0.037), ('asymptotically', 0.037), ('haughton', 0.036), ('settimi', 0.036), ('ud', 0.036), ('lemmas', 0.036), ('points', 0.036), ('adjusted', 0.035), ('yi', 0.035), ('model', 0.034), ('nyi', 0.034), ('dx', 0.034), ('xi', 0.031), ('curved', 0.031), ('gn', 0.031), ('classical', 0.031), ('analytic', 0.031), ('ai', 0.03), ('namely', 0.029), ('heckerman', 0.028), ('np', 0.028), ('mt', 0.028), ('relates', 0.028), ('statistics', 0.027), ('coef', 0.027), ('axi', 0.027), ('bxi', 0.027), ('dudv', 0.027), ('dxduds', 0.027), ('explicated', 0.027), ('minimums', 0.027), ('technion', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)

2 0.15226749 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

Author: Savina Andonova Jaeger

Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes

3 0.14197236 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

Author: Malte Kuss, Carl Edward Rasmussen

Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC

4 0.11395214 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

Author: Mario Marchand, Marina Sokolova

Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory

5 0.10987709 36 jmlr-2005-Gaussian Processes for Ordinal Regression

Author: Wei Chu, Zoubin Ghahramani

Abstract: We present a probabilistic kernel approach to ordinal regression based on Gaussian processes. A threshold model that generalizes the probit function is used as the likelihood function for ordinal variables. Two inference techniques, based on the Laplace approximation and the expectation propagation algorithm respectively, are derived for hyperparameter learning and model selection. We compare these two Gaussian process approaches with a previous ordinal regression method based on support vector machines on some benchmark and real-world data sets, including applications of ordinal regression to collaborative filtering and gene expression analysis. Experimental results on these data sets verify the usefulness of our approach. Keywords: Gaussian processes, ordinal regression, approximate Bayesian inference, collaborative filtering, gene expression analysis, feature selection

6 0.10242669 71 jmlr-2005-Variational Message Passing

7 0.098380722 22 jmlr-2005-Concentration Bounds for Unigram Language Models

8 0.082204141 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

9 0.08126609 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

10 0.078876197 32 jmlr-2005-Expectation Consistent Approximate Inference

11 0.071597613 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

12 0.068649992 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

13 0.058177367 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

14 0.052653268 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

15 0.052627269 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

16 0.052258763 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models

17 0.051365897 67 jmlr-2005-Stability of Randomized Learning Algorithms

18 0.046213564 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification

19 0.043055609 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds

20 0.041864626 64 jmlr-2005-Semigroup Kernels on Measures


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.311), (1, -0.302), (2, -0.198), (3, 0.169), (4, 0.013), (5, -0.083), (6, -0.025), (7, 0.013), (8, 0.052), (9, -0.154), (10, -0.029), (11, -0.036), (12, 0.007), (13, 0.012), (14, 0.15), (15, 0.181), (16, -0.032), (17, -0.058), (18, -0.057), (19, 0.04), (20, 0.1), (21, -0.104), (22, -0.095), (23, -0.138), (24, -0.002), (25, -0.051), (26, -0.038), (27, 0.104), (28, -0.031), (29, -0.019), (30, 0.058), (31, -0.09), (32, 0.023), (33, 0.013), (34, 0.014), (35, -0.04), (36, 0.004), (37, -0.008), (38, 0.057), (39, 0.05), (40, 0.017), (41, -0.061), (42, -0.035), (43, 0.159), (44, -0.05), (45, -0.082), (46, -0.013), (47, 0.028), (48, -0.121), (49, 0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96925199 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)

2 0.69653547 37 jmlr-2005-Generalization Bounds and Complexities Based on Sparsity and Clustering for Convex Combinations of Functions from Random Classes

Author: Savina Andonova Jaeger

Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes

3 0.49611381 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification

Author: Malte Kuss, Carl Edward Rasmussen

Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC

4 0.48871198 22 jmlr-2005-Concentration Bounds for Unigram Language Models

Author: Evgeny Drukh, Yishay Mansour

Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S:  |Xk − E[Xk ]| = O   k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof.  ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O   k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ   k ln 1 δ |Gk − Mk | = O  . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]|     √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √  |Gk − Mk | = O  Proof Let α = k ln m m δ + k ln m  m δ  . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]|  √  √  √  k ln m k ln m k ln m k ln m k δ δ δ δ  + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O  m 1241 3 2 +  ln m δ  . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k  ln m δ  , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have  ln 1 m  δ . k + ln m δ  ∀δ S : |Gk − Mk | = O  2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH  1 1 1 m ln + + δ m m = O 1 m  (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ   cw ≤ max pw +  m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < .  m m  4m  δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1  2n2 1 + ·O m−1 n1  ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤  2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to  2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax  ln 1 δ . m  ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w  ≤ O bLmax  = O Lmax   2b−1 ln 1 δ + ∑ L(k)O b m k=0  b4 ln b δ . m 1249  ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by :  αn1...τ cw = 0  mn0 cw −α qw = 1 ≤ cw ≤ τ m  cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251  λ ln 8m δ  + 1−α  3 ln 8 δ + M0  m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑  λ ln 2m δ  ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error  3 ln 2 α δ + M0  + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ  cw m pw  − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2    2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥  P  pw − cw > m 3 ln 2 δ m  3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw :  cw P  pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ  3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have  cw − pw = O  m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ  ln . m  ln 1 1 δ ˜ =O √ , m m  k − pw Xw,k = O  m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)]  1 Tm  Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m   1  Tm−k Tm  1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2  Tm  Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS   1 1  Tm−k  k Tm−2k = Θ 1− 1 + k m−k 1 − 1−  k m   Tm  Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k  1 − 1−  =  k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2  k 1 − m−k    k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 .  k 1 − m−k   k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw  λ ln  1−α 8m δ ≤  3 ln α + M0  + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264

5 0.48764667 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features

Author: Mario Marchand, Marina Sokolova

Abstract: We present a learning algorithm for decision lists which allows features that are constructed from the data and allows a trade-off between accuracy and complexity. We provide bounds on the generalization error of this learning algorithm in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. Furthermore, we show that the proposed bounds on the generalization error provide effective guides for model selection. Keywords: decision list machines, set covering machines, sparsity, data-dependent features, sample compression, model selection, learning theory

6 0.43678162 36 jmlr-2005-Gaussian Processes for Ordinal Regression

7 0.3858045 71 jmlr-2005-Variational Message Passing

8 0.34088579 40 jmlr-2005-Inner Product Spaces for Bayesian Networks

9 0.31358683 16 jmlr-2005-Asymptotics in Empirical Risk Minimization

10 0.30699754 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach

11 0.30616996 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

12 0.2796976 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

13 0.26885939 67 jmlr-2005-Stability of Randomized Learning Algorithms

14 0.22504258 1 jmlr-2005-A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes

15 0.21742223 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines

16 0.21137773 29 jmlr-2005-Efficient Margin Maximizing with Boosting

17 0.21067633 32 jmlr-2005-Expectation Consistent Approximate Inference

18 0.20961732 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

19 0.20724739 64 jmlr-2005-Semigroup Kernels on Measures

20 0.2059103 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.015), (17, 0.528), (19, 0.018), (36, 0.033), (37, 0.019), (43, 0.027), (47, 0.012), (52, 0.074), (59, 0.015), (70, 0.035), (88, 0.119), (90, 0.014), (94, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89159441 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks

Author: Dmitry Rusakov, Dan Geiger

Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)

2 0.89133704 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application

Author: Joseph F. Murray, Gordon F. Hughes, Kenneth Kreutz-Delgado

Abstract: We compare machine learning methods applied to a difficult real-world problem: predicting computer hard-drive failure using attributes monitored internally by individual drives. The problem is one of detecting rare events in a time series of noisy and nonparametrically-distributed data. We develop a new algorithm based on the multiple-instance learning framework and the naive Bayesian classifier (mi-NB) which is specifically designed for the low false-alarm case, and is shown to have promising performance. Other methods compared are support vector machines (SVMs), unsupervised clustering, and non-parametric statistical tests (rank-sum and reverse arrangements). The failure-prediction performance of the SVM, rank-sum and mi-NB algorithm is considerably better than the threshold method currently implemented in drives, while maintaining low false alarm rates. Our results suggest that nonparametric statistical tests should be considered for learning problems involving detecting rare events in time series data. An appendix details the calculation of rank-sum significance probabilities in the case of discrete, tied observations, and we give new recommendations about when the exact calculation should be used instead of the commonly-used normal approximation. These normal approximations may be particularly inaccurate for rare event problems like hard drive failures. Keywords: hard drive failure prediction, rank-sum test, support vector machines (SVM), exact nonparametric statistics, multiple instance naive-Bayes

3 0.41290972 72 jmlr-2005-What's Strange About Recent Events (WSARE): An Algorithm for the Early Detection of Disease Outbreaks

Author: Weng-Keen Wong, Andrew Moore, Gregory Cooper, Michael Wagner

Abstract: Traditional biosurveillance algorithms detect disease outbreaks by looking for peaks in a univariate time series of health-care data. Current health-care surveillance data, however, are no longer simply univariate data streams. Instead, a wealth of spatial, temporal, demographic and symptomatic information is available. We present an early disease outbreak detection algorithm called What’s Strange About Recent Events (WSARE), which uses a multivariate approach to improve its timeliness of detection. WSARE employs a rule-based technique that compares recent health-care data against data from a baseline distribution and finds subgroups of the recent data whose proportions have changed the most from the baseline data. In addition, health-care data also pose difficulties for surveillance algorithms because of inherent temporal trends such as seasonal effects and day of week variations. WSARE approaches this problem using a Bayesian network to produce a baseline distribution that accounts for these temporal trends. The algorithm itself incorporates a wide range of ideas, including association rules, Bayesian networks, hypothesis testing and permutation tests to produce a detection algorithm that is careful to evaluate the significance of the alarms that it raises. Keywords: anomaly detection, syndromic surveillance, biosurveillance, Bayesian networks, applications

4 0.40125048 44 jmlr-2005-Learning Module Networks

Author: Eran Segal, Dana Pe'er, Aviv Regev, Daphne Koller, Nir Friedman

Abstract: Methods for learning Bayesian networks can discover dependency structure between observed variables. Although these methods are useful in many applications, they run into computational and statistical problems in domains that involve a large number of variables. In this paper,1 we consider a solution that is applicable when many variables have similar behavior. We introduce a new class of models, module networks, that explicitly partition the variables into modules, so that the variables in each module share the same parents in the network and the same conditional probability distribution. We define the semantics of module networks, and describe an algorithm that learns the modules’ composition and their dependency structure from data. Evaluation on real data in the domains of gene expression and the stock market shows that module networks generalize better than Bayesian networks, and that the learned module network structure reveals regularities that are obscured in learned Bayesian networks. 1. A preliminary version of this paper appeared in the Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, 2003 (UAI ’03). c 2005 Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller and Nir Friedman. S EGAL , P E ’ ER , R EGEV, KOLLER AND F RIEDMAN

5 0.39932573 39 jmlr-2005-Information Bottleneck for Gaussian Variables

Author: Gal Chechik, Amir Globerson, Naftali Tishby, Yair Weiss

Abstract: The problem of extracting the relevant aspects of data was previously addressed through the information bottleneck (IB) method, through (soft) clustering one variable while preserving information about another - relevance - variable. The current work extends these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters, for the special case of multivariate Gaussian variables. While the general continuous IB problem is difficult to solve, we provide an analytic solution for the optimal representation and tradeoff between compression and relevance for the this important case. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized regression matrix Σx|y Σ−1 , which is also the basis obtained x in canonical correlation analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector, through a cascade of structural phase transitions. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides a complete analytic expression of the preserved information as a function of the compression (the “information-curve”), in terms of the eigenvalue spectrum of the data. As in the discrete case, the information curve is concave and smooth, though it is made of different analytic segments for each optimal dimension. Finally, we show how the algorithmic theory developed in the IB framework provides an iterative algorithm for obtaining the optimal Gaussian projections. Keywords: information bottleneck, Gaussian processes, dimensionality reduction, canonical correlation analysis

6 0.38641348 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching

7 0.38606545 20 jmlr-2005-Clustering with Bregman Divergences

8 0.36696675 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints

9 0.36653262 3 jmlr-2005-A Classification Framework for Anomaly Detection

10 0.3631942 71 jmlr-2005-Variational Message Passing

11 0.35854957 48 jmlr-2005-Learning the Kernel Function via Regularization

12 0.35597786 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions

13 0.35540086 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve

14 0.35124075 36 jmlr-2005-Gaussian Processes for Ordinal Regression

15 0.34915909 64 jmlr-2005-Semigroup Kernels on Measures

16 0.34233537 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors

17 0.34165695 22 jmlr-2005-Concentration Bounds for Unigram Language Models

18 0.34110758 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

19 0.33847657 49 jmlr-2005-Learning the Kernel with Hyperkernels     (Kernel Machines Section)

20 0.33700711 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels