jmlr jmlr2005 jmlr2005-22 jmlr2005-22-reference knowledge-graph by maker-knowledge-mining

22 jmlr-2005-Concentration Bounds for Unigram Language Models


Source: pdf

Author: Evgeny Drukh, Yishay Mansour

Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S:  |Xk − E[Xk ]| = O   k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof.  ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O   k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ   k ln 1 δ |Gk − Mk | = O  . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]|     √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √  |Gk − Mk | = O  Proof Let α = k ln m m δ + k ln m  m δ  . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]|  √  √  √  k ln m k ln m k ln m k ln m k δ δ δ δ  + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O  m 1241 3 2 +  ln m δ  . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k  ln m δ  , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have  ln 1 m  δ . k + ln m δ  ∀δ S : |Gk − Mk | = O  2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH  1 1 1 m ln + + δ m m = O 1 m  (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ   cw ≤ max pw +  m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < .  m m  4m  δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1  2n2 1 + ·O m−1 n1  ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤  2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to  2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax  ln 1 δ . m  ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w  ≤ O bLmax  = O Lmax   2b−1 ln 1 δ + ∑ L(k)O b m k=0  b4 ln b δ . m 1249  ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by :  αn1...τ cw = 0  mn0 cw −α qw = 1 ≤ cw ≤ τ m  cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251  λ ln 8m δ  + 1−α  3 ln 8 δ + M0  m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑  λ ln 2m δ  ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error  3 ln 2 α δ + M0  + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ  cw m pw  − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2    2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥  P  pw − cw > m 3 ln 2 δ m  3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw :  cw P  pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ  3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have  cw − pw = O  m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ  ln . m  ln 1 1 δ ˜ =O √ , m m  k − pw Xw,k = O  m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)]  1 Tm  Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m   1  Tm−k Tm  1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2  Tm  Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS   1 1  Tm−k  k Tm−2k = Θ 1− 1 + k m−k 1 − 1−  k m   Tm  Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k  1 − 1−  =  k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2  k 1 − m−k    k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 .  k 1 − m−k   k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw  λ ln  1−α 8m δ ≤  3 ln α + M0  + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264


reference text

D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264