jmlr jmlr2005 jmlr2005-22 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Evgeny Drukh, Yishay Mansour
Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S: |Xk − E[Xk ]| = O k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof. ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ k ln 1 δ |Gk − Mk | = O . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √ |Gk − Mk | = O Proof Let α = k ln m m δ + k ln m m δ . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ √ √ k ln m k ln m k ln m k ln m k δ δ δ δ + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O m 1241 3 2 + ln m δ . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k ln m δ , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have ln 1 m δ . k + ln m δ ∀δ S : |Gk − Mk | = O 2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH 1 1 1 m ln + + δ m m = O 1 m (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ cw ≤ max pw + m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < . m m 4m δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1 2n2 1 + ·O m−1 n1 ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤ 2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax ln 1 δ . m ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w ≤ O bLmax = O Lmax 2b−1 ln 1 δ + ∑ L(k)O b m k=0 b4 ln b δ . m 1249 ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by : αn1...τ cw = 0 mn0 cw −α qw = 1 ≤ cw ≤ τ m cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251 λ ln 8m δ + 1−α 3 ln 8 δ + M0 m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑ λ ln 2m δ ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error 3 ln 2 α δ + M0 + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ cw m pw − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2 2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥ P pw − cw > m 3 ln 2 δ m 3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw : cw P pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ 3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have cw − pw = O m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ ln . m ln 1 1 δ ˜ =O √ , m m k − pw Xw,k = O m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)] 1 Tm Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m 1 Tm−k Tm 1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2 Tm Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1 1 Tm−k k Tm−2k = Θ 1− 1 + k m−k 1 − 1− k m Tm Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k 1 − 1− = k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2 k 1 − m−k k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 . k 1 − m−k k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw λ ln 1−α 8m δ ≤ 3 ln α + M0 + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264
Reference: text
sentIndex sentText sentNum sentScore
1 Let P be a distribution over V , where pw is the probability of a word w ∈ V . [sent-88, score-0.64]
2 We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . [sent-106, score-1.752]
3 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. [sent-134, score-0.681]
4 Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. [sent-161, score-0.619]
5 The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . [sent-165, score-1.806]
6 mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . [sent-172, score-1.36]
7 Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . [sent-181, score-1.211]
8 Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . [sent-188, score-1.444]
9 ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . [sent-195, score-0.754]
10 We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . [sent-214, score-1.206]
11 For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . [sent-217, score-2.412]
12 δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . [sent-218, score-1.186]
13 Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . [sent-220, score-1.198]
14 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . [sent-224, score-0.64]
15 Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . [sent-229, score-1.165]
16 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). [sent-230, score-1.355]
17 Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. [sent-231, score-0.643]
18 Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). [sent-234, score-1.206]
19 ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ k ln 1 δ |Gk − Mk | = O . [sent-254, score-0.802]
20 Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √ |Gk − Mk | = O Proof Let α = k ln m m δ + k ln m m δ . [sent-270, score-0.774]
21 By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . [sent-275, score-1.032]
22 Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O m 1241 3 2 + ln m δ . [sent-279, score-0.774]
23 Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. [sent-282, score-1.206]
24 Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? [sent-283, score-1.206]
25 The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . [sent-319, score-0.632]
26 Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . [sent-330, score-1.248]
27 We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . [sent-333, score-1.077]
28 m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . [sent-334, score-0.645]
29 nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. [sent-344, score-1.767]
30 Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . [sent-347, score-0.774]
31 By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . [sent-371, score-3.615]
32 δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . [sent-372, score-1.464]
33 We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . [sent-373, score-2.298]
34 By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . [sent-375, score-0.831]
35 Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . [sent-376, score-1.161]
36 m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . [sent-378, score-2.007]
37 We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . [sent-381, score-2.406]
38 m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . [sent-382, score-1.257]
39 Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . [sent-386, score-0.686]
40 Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . [sent-387, score-1.168]
41 The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . [sent-389, score-0.705]
42 Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . [sent-390, score-2.517]
43 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH 1 1 1 m ln + + δ m m = O 1 m (ln m)2 ln 1 δ . [sent-392, score-1.923]
44 By Equations (14) and (20), for any w such that pw ≤ m , we have δ cw ≤ max pw + m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < . [sent-398, score-2.28]
45 m m 4m δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. [sent-399, score-0.804]
46 m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤ 2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax ln 1 δ . [sent-406, score-0.774]
47 Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w ≤ O bLmax = O Lmax 2b−1 ln 1 δ + ∑ L(k)O b m k=0 b4 ln b δ . [sent-410, score-1.665]
48 Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . [sent-425, score-0.675]
49 Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. [sent-433, score-1.463]
50 Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . [sent-434, score-1.03]
51 = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. [sent-440, score-1.762]
52 τ cw = 0 mn0 cw −α qw = 1 ≤ cw ≤ τ m cw τ < cw . [sent-450, score-2.859]
53 m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. [sent-455, score-1.206]
54 √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1. [sent-459, score-2.367]
55 τ α F 8m + 1 − α λ lnm δ 1251 λ ln 8m δ + 1−α 3 ln 8 δ + M0 m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . [sent-462, score-1.308]
56 We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . [sent-464, score-1.606]
57 Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1. [sent-469, score-2.138]
58 Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. [sent-485, score-2.454]
59 ∑ λ ln 2m δ ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . [sent-486, score-1.301]
60 pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. [sent-487, score-0.732]
61 We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . [sent-493, score-5.457]
62 Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. [sent-498, score-0.89]
63 The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . [sent-499, score-2.367]
64 √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . [sent-500, score-7.241]
65 cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . [sent-501, score-3.015]
66 Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . [sent-503, score-6.695]
67 λ cw m pw − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2 2 . [sent-504, score-8.214]
68 The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . [sent-507, score-1.42]
69 m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . [sent-508, score-2.526]
70 δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). [sent-509, score-0.615]
71 Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥ P pw − cw > m 3 ln 2 δ m 3pw ln 2 δ ≤ δ. [sent-526, score-2.305]
72 m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw : cw P pw − ≥ m , we have 3mpw ln 2 . [sent-527, score-1.717]
73 Note that for 3 ln 2 ≤ δ δ 3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. [sent-528, score-1.36]
74 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . [sent-529, score-0.89]
75 We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . [sent-532, score-4.636]
76 First, suppose that δ b there are up to m words with pw ≤ m . [sent-536, score-0.632]
77 We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . [sent-538, score-0.953]
78 Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. [sent-553, score-0.632]
79 That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. [sent-561, score-1.218]
80 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . [sent-562, score-2.017]
81 By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . [sent-563, score-2.36]
82 λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . [sent-564, score-1.211]
83 δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . [sent-565, score-1.211]
84 This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . [sent-566, score-1.211]
85 pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . [sent-569, score-3.68]
86 cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. [sent-572, score-1.648]
87 Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. [sent-573, score-0.861]
88 We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. [sent-576, score-2.35]
89 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . [sent-577, score-0.774]
90 We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . [sent-578, score-1.781]
91 P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. [sent-579, score-0.643]
92 Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . [sent-582, score-1.822]
93 Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have cw − pw = O m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. [sent-596, score-1.189]
94 m ln 1 1 δ ˜ =O √ , m m k − pw Xw,k = O m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . [sent-598, score-0.89]
95 ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. [sent-608, score-1.282]
96 E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). [sent-614, score-3.313]
97 Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. [sent-618, score-0.814]
98 By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1. [sent-621, score-2.109]
99 (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw λ ln 1−α 8m δ ≤ 3 ln α + M0 + F 8m . [sent-625, score-2.367]
100 m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . [sent-626, score-2.643]
wordName wordTfidf (topN-words)
[('pw', 0.603), ('cw', 0.546), ('ln', 0.258), ('mk', 0.25), ('vh', 0.195), ('mpw', 0.149), ('qw', 0.129), ('tm', 0.102), ('vk', 0.096), ('gk', 0.093), ('vl', 0.091), ('cu', 0.082), ('lemma', 0.081), ('ansour', 0.077), ('rukh', 0.077), ('bn', 0.074), ('anguage', 0.072), ('nigram', 0.072), ('odels', 0.072), ('oncentration', 0.072), ('fw', 0.054), ('bw', 0.046), ('gl', 0.046), ('errw', 0.045), ('ounds', 0.045), ('mass', 0.042), ('mcallester', 0.04), ('completes', 0.04), ('estimator', 0.04), ('word', 0.037), ('pu', 0.036), ('lw', 0.034), ('blmax', 0.032), ('words', 0.029), ('frequencies', 0.029), ('concentration', 0.029), ('zk', 0.028), ('yw', 0.027), ('yu', 0.025), ('yk', 0.025), ('names', 0.024), ('fy', 0.023), ('pe', 0.021), ('tn', 0.019), ('cv', 0.018), ('lnm', 0.018), ('pinsker', 0.018), ('logarithmic', 0.018), ('language', 0.017), ('missing', 0.016), ('appearing', 0.016), ('sk', 0.016), ('binomial', 0.016), ('bin', 0.016), ('xk', 0.015), ('nlp', 0.015), ('unigram', 0.015), ('lmax', 0.015), ('pvi', 0.015), ('estimation', 0.015), ('mcdiarmid', 0.014), ('tx', 0.014), ('alice', 0.014), ('appearances', 0.014), ('cvi', 0.014), ('drukh', 0.014), ('dubhashi', 0.014), ('knl', 0.014), ('lleave', 0.014), ('pretending', 0.014), ('mkl', 0.013), ('bound', 0.013), ('equations', 0.013), ('max', 0.012), ('px', 0.012), ('chen', 0.012), ('proof', 0.012), ('count', 0.012), ('estimators', 0.012), ('successes', 0.011), ('turing', 0.011), ('schapire', 0.011), ('vi', 0.011), ('tk', 0.01), ('rede', 0.01), ('bounds', 0.01), ('unseen', 0.01), ('theorem', 0.01), ('monotone', 0.009), ('qx', 0.009), ('church', 0.009), ('cvh', 0.009), ('cvl', 0.009), ('errv', 0.009), ('evgeny', 0.009), ('pvh', 0.009), ('pvl', 0.009), ('ranjan', 0.009), ('tel', 0.009), ('tmn', 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 22 jmlr-2005-Concentration Bounds for Unigram Language Models
Author: Evgeny Drukh, Yishay Mansour
Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S: |Xk − E[Xk ]| = O k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof. ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ k ln 1 δ |Gk − Mk | = O . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √ |Gk − Mk | = O Proof Let α = k ln m m δ + k ln m m δ . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ √ √ k ln m k ln m k ln m k ln m k δ δ δ δ + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O m 1241 3 2 + ln m δ . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k ln m δ , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have ln 1 m δ . k + ln m δ ∀δ S : |Gk − Mk | = O 2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH 1 1 1 m ln + + δ m m = O 1 m (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ cw ≤ max pw + m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < . m m 4m δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1 2n2 1 + ·O m−1 n1 ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤ 2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax ln 1 δ . m ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w ≤ O bLmax = O Lmax 2b−1 ln 1 δ + ∑ L(k)O b m k=0 b4 ln b δ . m 1249 ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by : αn1...τ cw = 0 mn0 cw −α qw = 1 ≤ cw ≤ τ m cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251 λ ln 8m δ + 1−α 3 ln 8 δ + M0 m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑ λ ln 2m δ ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error 3 ln 2 α δ + M0 + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ cw m pw − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2 2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥ P pw − cw > m 3 ln 2 δ m 3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw : cw P pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ 3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have cw − pw = O m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ ln . m ln 1 1 δ ˜ =O √ , m m k − pw Xw,k = O m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)] 1 Tm Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m 1 Tm−k Tm 1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2 Tm Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1 1 Tm−k k Tm−2k = Θ 1− 1 + k m−k 1 − 1− k m Tm Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k 1 − 1− = k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2 k 1 − m−k k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 . k 1 − m−k k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw λ ln 1−α 8m δ ≤ 3 ln α + M0 + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
3 0.098380722 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
4 0.068182811 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
Author: Malte Kuss, Carl Edward Rasmussen
Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC
5 0.0643517 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
6 0.045173258 32 jmlr-2005-Expectation Consistent Approximate Inference
7 0.041127883 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
8 0.039505728 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
9 0.034655858 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
10 0.033509169 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
11 0.031898852 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
12 0.031793427 36 jmlr-2005-Gaussian Processes for Ordinal Regression
13 0.031174894 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
14 0.02516488 11 jmlr-2005-Algorithmic Stability and Meta-Learning
15 0.023168517 29 jmlr-2005-Efficient Margin Maximizing with Boosting
16 0.020831531 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
17 0.017582798 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
18 0.017426733 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
19 0.016482919 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
20 0.016106224 50 jmlr-2005-Learning with Decision Lists of Data-Dependent Features
topicId topicWeight
[(0, 0.141), (1, -0.228), (2, -0.114), (3, 0.184), (4, -0.042), (5, -0.098), (6, 0.001), (7, 0.025), (8, 0.03), (9, -0.085), (10, -0.157), (11, -0.112), (12, -0.194), (13, 0.063), (14, 0.144), (15, -0.03), (16, 0.213), (17, 0.108), (18, -0.183), (19, 0.069), (20, 0.113), (21, -0.074), (22, 0.006), (23, 0.135), (24, 0.142), (25, -0.156), (26, 0.051), (27, 0.059), (28, 0.038), (29, 0.138), (30, 0.013), (31, -0.045), (32, -0.034), (33, 0.162), (34, 0.257), (35, 0.048), (36, -0.005), (37, 0.184), (38, -0.002), (39, -0.119), (40, -0.078), (41, -0.136), (42, -0.07), (43, 0.08), (44, 0.137), (45, -0.071), (46, 0.01), (47, -0.235), (48, 0.242), (49, -0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.9627623 22 jmlr-2005-Concentration Bounds for Unigram Language Models
Author: Evgeny Drukh, Yishay Mansour
Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S: |Xk − E[Xk ]| = O k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof. ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ k ln 1 δ |Gk − Mk | = O . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √ |Gk − Mk | = O Proof Let α = k ln m m δ + k ln m m δ . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ √ √ k ln m k ln m k ln m k ln m k δ δ δ δ + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O m 1241 3 2 + ln m δ . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k ln m δ , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have ln 1 m δ . k + ln m δ ∀δ S : |Gk − Mk | = O 2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH 1 1 1 m ln + + δ m m = O 1 m (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ cw ≤ max pw + m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < . m m 4m δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1 2n2 1 + ·O m−1 n1 ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤ 2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax ln 1 δ . m ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w ≤ O bLmax = O Lmax 2b−1 ln 1 δ + ∑ L(k)O b m k=0 b4 ln b δ . m 1249 ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by : αn1...τ cw = 0 mn0 cw −α qw = 1 ≤ cw ≤ τ m cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251 λ ln 8m δ + 1−α 3 ln 8 δ + M0 m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑ λ ln 2m δ ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error 3 ln 2 α δ + M0 + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ cw m pw − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2 2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥ P pw − cw > m 3 ln 2 δ m 3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw : cw P pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ 3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have cw − pw = O m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ ln . m ln 1 1 δ ˜ =O √ , m m k − pw Xw,k = O m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)] 1 Tm Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m 1 Tm−k Tm 1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2 Tm Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1 1 Tm−k k Tm−2k = Θ 1− 1 + k m−k 1 − 1− k m Tm Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k 1 − 1− = k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2 k 1 − m−k k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 . k 1 − m−k k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw λ ln 1−α 8m δ ≤ 3 ln α + M0 + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264
Author: Savina Andonova Jaeger
Abstract: A unified approach is taken for deriving new generalization data dependent bounds for several classes of algorithms explored in the existing literature by different approaches. This unified approach is based on an extension of Vapnik’s inequality for VC classes of sets to random classes of sets - that is, classes depending on the random data, invariant under permutation of the data and possessing the increasing property. Generalization bounds are derived for convex combinations of functions from random classes with certain properties. Algorithms, such as SVMs (support vector machines), boosting with decision stumps, radial basis function networks, some hierarchies of kernel machines or convex combinations of indicator functions over sets with finite VC dimension, generate classifier functions that fall into the above category. We also explore the individual complexities of the classifiers, such as sparsity of weights and weighted variance over clusters from the convex combination introduced by Koltchinskii and Panchenko (2004), and show sparsity-type and cluster-variance-type generalization bounds for random classes. Keywords: complexities of classifiers, generalization bounds, SVM, voting classifiers, random classes
3 0.35043031 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
Author: Dmitry Rusakov, Dan Geiger
Abstract: We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features. This formula deviates from the standard BIC score. Our work provides a concrete example that the BIC score is generally incorrect for statistical models that belong to stratified exponential families. This claim stands in contrast to linear and curved exponential families, where the BIC score has been proven to provide a correct asymptotic approximation for the marginal likelihood. Keywords: Bayesian networks, asymptotic model selection, Bayesian information criterion (BIC)
4 0.22102542 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
Author: Malte Kuss, Carl Edward Rasmussen
Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC
5 0.18463007 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
6 0.18151742 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
7 0.14941122 32 jmlr-2005-Expectation Consistent Approximate Inference
8 0.13883825 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
9 0.1385503 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
10 0.12987934 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
11 0.12515059 3 jmlr-2005-A Classification Framework for Anomaly Detection
12 0.10070391 13 jmlr-2005-Analysis of Variance of Cross-Validation Estimators of the Generalization Error
13 0.099210046 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
14 0.099091686 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
15 0.095598683 25 jmlr-2005-Denoising Source Separation
16 0.095413417 68 jmlr-2005-Tree-Based Batch Mode Reinforcement Learning
17 0.082032531 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
18 0.066626921 66 jmlr-2005-Smooth ε-Insensitive Regression by Loss Symmetrization
19 0.064958237 42 jmlr-2005-Large Margin Methods for Structured and Interdependent Output Variables
20 0.062021147 11 jmlr-2005-Algorithmic Stability and Meta-Learning
topicId topicWeight
[(17, 0.032), (19, 0.02), (36, 0.014), (37, 0.012), (43, 0.044), (47, 0.011), (52, 0.044), (59, 0.013), (70, 0.022), (80, 0.511), (88, 0.102), (92, 0.017), (94, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.78050309 22 jmlr-2005-Concentration Bounds for Unigram Language Models
Author: Evgeny Drukh, Yishay Mansour
Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S: |Xk − E[Xk ]| = O k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof. ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ k ln 1 δ |Gk − Mk | = O . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √ |Gk − Mk | = O Proof Let α = k ln m m δ + k ln m m δ . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ √ √ k ln m k ln m k ln m k ln m k δ δ δ δ + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O m 1241 3 2 + ln m δ . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k ln m δ , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have ln 1 m δ . k + ln m δ ∀δ S : |Gk − Mk | = O 2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH 1 1 1 m ln + + δ m m = O 1 m (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ cw ≤ max pw + m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < . m m 4m δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1 2n2 1 + ·O m−1 n1 ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤ 2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax ln 1 δ . m ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w ≤ O bLmax = O Lmax 2b−1 ln 1 δ + ∑ L(k)O b m k=0 b4 ln b δ . m 1249 ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by : αn1...τ cw = 0 mn0 cw −α qw = 1 ≤ cw ≤ τ m cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251 λ ln 8m δ + 1−α 3 ln 8 δ + M0 m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑ λ ln 2m δ ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error 3 ln 2 α δ + M0 + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ cw m pw − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2 2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥ P pw − cw > m 3 ln 2 δ m 3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw : cw P pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ 3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have cw − pw = O m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ ln . m ln 1 1 δ ˜ =O √ , m m k − pw Xw,k = O m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)] 1 Tm Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m 1 Tm−k Tm 1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2 Tm Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1 1 Tm−k k Tm−2k = Θ 1− 1 + k m−k 1 − 1− k m Tm Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k 1 − 1− = k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2 k 1 − m−k k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 . k 1 − m−k k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw λ ln 1−α 8m δ ≤ 3 ln α + M0 + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264
2 0.7801227 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
3 0.72660798 25 jmlr-2005-Denoising Source Separation
Author: Jaakko Särelä, Harri Valpola
Abstract: A new algorithmic framework called denoising source separation (DSS) is introduced. The main benefit of this framework is that it allows for the easy development of new source separation algorithms which can be optimised for specific problems. In this framework, source separation algorithms are constructed around denoising procedures. The resulting algorithms can range from almost blind to highly specialised source separation algorithms. Both simple linear and more complex nonlinear or adaptive denoising schemes are considered. Some existing independent component analysis algorithms are reinterpreted within the DSS framework and new, robust blind source separation algorithms are suggested. The framework is derived as a one-unit equivalent to an EM algorithm for source separation. However, in the DSS framework it is easy to utilise various kinds of denoising procedures which need not be based on generative models. In the experimental section, various DSS schemes are applied extensively to artificial data, to real magnetoencephalograms and to simulated CDMA mobile network signals. Finally, various extensions to the proposed DSS algorithms are considered. These include nonlinear observation mappings, hierarchical models and over-complete, nonorthogonal feature spaces. With these extensions, DSS appears to have relevance to many existing models of neural information processing. Keywords: blind source separation, BSS, prior information, denoising, denoising source separation, DSS, independent component analysis, ICA, magnetoencephalograms, MEG, CDMA
4 0.27261955 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
Author: Luís B. Almeida
Abstract: When acquiring an image of a paper document, the image printed on the back page sometimes shows through. The mixture of the front- and back-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind source separation. This paper addresses a difficult version of this problem, corresponding to the use of “onion skin” paper, which results in a relatively strong nonlinearity of the mixture, which becomes close to singular in the lighter regions of the images. The separation is achieved through the MISEP technique, which is an extension of the well known INFOMAX method. The separation results are assessed with objective quality measures. They show an improvement over the results obtained with linear separation, but have room for further improvement. Keywords: ICA, blind source separation, nonlinear mixtures, nonlinear separation, image mixture, image separation
5 0.25490159 20 jmlr-2005-Clustering with Bregman Divergences
Author: Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh
Abstract: A wide variety of distortion functions, such as squared Euclidean distance, Mahalanobis distance, Itakura-Saito distance and relative entropy, have been used for clustering. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroid-based parametric clustering approaches, such as classical kmeans, the Linde-Buzo-Gray (LBG) algorithm and information-theoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the method to a large class of clustering loss functions. This is achieved by first posing the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by rate distortion theory, and then deriving an iterative algorithm that monotonically decreases this loss. In addition, we show that there is a bijection between regular exponential families and a large class of Bregman divergences, that we call regular Bregman divergences. This result enables the development of an alternative interpretation of an efficient EM scheme for learning mixtures of exponential family distributions, and leads to a simple soft clustering algorithm for regular Bregman divergences. Finally, we discuss the connection between rate distortion theory and Bregman clustering and present an information theoretic analysis of Bregman clustering algorithms in terms of a trade-off between compression and loss in Bregman information. Keywords: clustering, Bregman divergences, Bregman information, exponential families, expectation maximization, information theory
6 0.24556875 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
7 0.23009974 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
8 0.22717361 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
9 0.22548573 60 jmlr-2005-On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
10 0.22484627 11 jmlr-2005-Algorithmic Stability and Meta-Learning
11 0.21970189 38 jmlr-2005-Generalization Bounds for the Area Under the ROC Curve
12 0.21910778 3 jmlr-2005-A Classification Framework for Anomaly Detection
13 0.21876976 44 jmlr-2005-Learning Module Networks
14 0.21797904 56 jmlr-2005-Maximum Margin Algorithms with Boolean Kernels
15 0.21793149 48 jmlr-2005-Learning the Kernel Function via Regularization
16 0.21738595 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
17 0.21719405 39 jmlr-2005-Information Bottleneck for Gaussian Variables
18 0.2167522 71 jmlr-2005-Variational Message Passing
19 0.21672401 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)