jmlr jmlr2005 jmlr2005-32 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
Reference: text
sentIndex sentText sentNum sentScore
1 Jordan Abstract We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. [sent-7, score-0.617]
2 The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. [sent-8, score-0.202]
3 It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. [sent-9, score-0.28]
4 Significant improvements are obtained when a spanning tree is used instead. [sent-14, score-0.147]
5 In order to treat systems with a large number of variables, it is therefore necessary to use approximate polynomial complexity inference methods. [sent-18, score-0.143]
6 Probably the most prominent and widely developed approximation technique is the socalled variational (or variational Bayes) approximation (see, e. [sent-19, score-0.274]
7 In this approach, the true intractable probability distribution is approximated by another one which is optimally chosen within a given, tractable family minimizing the Kullback Leibler (KL) divergence as the measure of dissimilarity between distributions. [sent-24, score-0.171]
8 We will use the name variational bound for this specific method because the approximation results in an upper bound to the free energy (an entropic quantity related to the KL divergence). [sent-25, score-0.622]
9 The alternative approximation methods discussed in this paper can also be derived from the variation of an approximate free energy which is not necessarily c 2005 Manfred Opper and Ole Winther. [sent-26, score-0.555]
10 However, their application in the variational bound approximation is limited to distributions of continuous variables which have the entire real space as their natural domain. [sent-30, score-0.137]
11 Recently, a lot of effort has been devoted to the development of approximation techniques which give an improved performance compared to the variational bound approximation. [sent-34, score-0.137]
12 EP is based on a dynamical picture where factors—their product forming a global tractable approximate distribution—are iteratively optimized. [sent-36, score-0.119]
13 However, while tree-type approximations often work well for sparsely connected graphs they become inadequate for inference problems in a dense graph regardless of the type of variables. [sent-45, score-0.145]
14 In this paper we present an alternative view of local-consistency approximations of the EP–type which we call expectation consistent (EC) approximations. [sent-46, score-0.109]
15 Our method is a generalization of the adaptive TAP approach (ADATAP) (Opper and Winther, 2001a,b) developed for inference on densely connected graphical models. [sent-48, score-0.131]
16 , 2002), we believe that it is more natural and more powerful to base the o derivation on a framework of optimizing a free energy approximation. [sent-54, score-0.485]
17 Our paper is organized as follows: Section 2 motivates approximate inference and explains the notation. [sent-56, score-0.143]
18 The expectation consistent (EC) approximation to the free energy is derived in Section 3. [sent-57, score-0.584]
19 Examples for EC free energies are given in Section 4. [sent-58, score-0.3]
20 In a typical scenario, f (x) is expressed as a product of two functions f (x) = fq (x)fr (x) (2) with fq,r (x) ≥ 0, where fq is “simple” enough to allow for tractable computations. [sent-71, score-0.372]
21 The goal is to approximate the “complicated” part fr (x) by replacing it with a “simpler” function, say of some exponential form K exp λT g(x) ≡ exp j=1 λj gj (x) . [sent-72, score-0.383]
22 2 In such a case, one may alternatively retain fr but replace fq by an approximation of the form eq. [sent-80, score-0.447]
23 One would then end up with two types of approximations q(x) = r(x) = 1 fq (x) exp λT g(x) q Zq (λq ) 1 fr (x) exp λT g(x) r Zr (λr ) (4) (5) for the same density, where Zq (λq ) = dx fq (x) exp λT g(x) . [sent-82, score-0.742]
24 We will not assume that q either choice q and r is a reasonably good approximation for the global joint density p(x) as we do in the variational bound approximation. [sent-83, score-0.173]
25 Take, as an example, the density p(x) = f (x)/Z = fq (x)fr (x)/Z—with respect to the Lebesgue measure in RN —with fq (x) = ψi (xi ) (6) i fr (x) = exp xi Jij xj + i 0. [sent-86, score-0.761]
26 for exact computation, half of that for the single-loop tree and one-tenth for the factorized singleloop. [sent-100, score-0.178]
27 A greedy single-loop variant of the sequential double-loop algorithm, where the outer loop update is performed after every inner loop update, converged more often without being much slower than the EP-style algorithm. [sent-107, score-0.196]
28 Conclusion and Outlook We have introduced a novel method for approximate inference which tries to overcome limitations of previous approximations in dealing with the correlations of random variables. [sent-113, score-0.184]
29 We expect that our method becomes most powerful when certain tractable substructures of variables with strong dependencies can be identified in a model. [sent-117, score-0.121]
30 Our approach would then 2191 Opper and Winther 0 0 10 MAD 2 node marginals MAD 1 node marginals 10 −2 10 −4 10 −2 10 −3 10 −4 10 −5 −6 10 −1 10 0 10 β 10 1 0 10 β 10 1 10 Figure 1: Maximal absolute deviation (MAD) for one- (left) and twovariable (right) marginals. [sent-118, score-0.142]
31 Consider inference on the square grid as a problem where one can introduce tractable substructures without getting a very large increase in complexity. [sent-125, score-0.225]
32 This is useful in cases, where an initial three term approximation − ln Z EC = − ln Zq − ln Zr + ln Zs still contains non-tractable component free energies. [sent-261, score-0.887]
33 For practical applicability of approximate inference techniques improvements in the numerical implementation of the free energy minimization are crucial. [sent-264, score-0.628]
34 In the simulations in this paper we used both single and double loop algorithms. [sent-265, score-0.139]
35 In the guaranteed convergent double loop approaches the free energy minimization is formulated in terms of a sequence of convex optimization problems. [sent-271, score-0.655]
36 One may calculate corrections to the EC free energy and marginals by a perturbative analysis using cumulant expansions of the approximating distributions. [sent-276, score-0.556]
37 Another possible improvement could come from physics of disordered system where methods have be devised to analyze nonergodic free energy landscapes (M´zard et al. [sent-280, score-0.515]
38 This will allow to make improved e estimates of the free energy and marginals for example binary variables with large coupling strengths. [sent-282, score-0.556]
39 Dual Formulation In this appendix we present an alternative route to EC free energy approximation using a two stage variational formulation. [sent-294, score-0.622]
40 The result is the so-called Gibbs free energy which is the Lagragian dual of the Helmholtz free energy eq. [sent-295, score-0.97]
41 We introduce the Gibbs free energy (GFE) approach, (see, e. [sent-299, score-0.485]
42 We define the µ Gibbs free energy G(µ ) as µ G(µ ) = min {KL(q, p) | g(x) q q = µ } − ln Z . [sent-304, score-0.648]
43 (39) The term ln Z has been subtracted to make the resulting expression independent of the intractable partition function Z. [sent-305, score-0.254]
44 µ min G(µ ) = − ln Z (40) µ g(x) µ = argmin G(µ ) . [sent-307, score-0.163]
45 (41) µ A variational bound approximation is recovered by restricting the minimization in eq. [sent-308, score-0.137]
46 (43) ∂λ In the following, it should be clear from the context when λ is a free variable or is to be determined from eq. [sent-319, score-0.204]
47 (42) into the definition of the Gibbs free energy eq. [sent-322, score-0.485]
48 (39), we get the simpler expression: µ µ µµ G(µ ) = − ln Z(λ(µ )) + λT (µ )µ = max − ln Z(λ) + λT µ λ µ showing that G(µ ) is the Lagrangian dual of ln Z(λ). [sent-323, score-0.489]
49 Using the notation p(x|t) = f (x,t) (which should not be confused with a conditional probability), we calcuZt µ late the derivative of G(µ , t) using (43) and (44) as for fixed µ : µ dG(µ , t) dt where Z(λ, t) = = − ∂ ln Z(λ, t) ∂ ln Z(λ, t) + µ− ∂t ∂λ dx f (x, t) exp λT g(x) . [sent-327, score-0.479]
50 2195 dλT ∂ ln Z(λ, t) =− , dt ∂t (45) Opper and Winther B. [sent-328, score-0.251]
51 2 An Interpolation Representation of Free Energies If the density p factors into a tractable fq and an intractable part fr , according to eq. [sent-329, score-0.623]
52 (2), we can construct a representation of the Gibbs free energy which also separates into two corresponding parts. [sent-330, score-0.485]
53 This is done by treating fr (x) as a perturbation which is smoothly turned on using a parameter 0 ≤ t ≤ 1. [sent-331, score-0.27]
54 We define fr (x, t) to be a smooth interpolation between the trivial fr (x, t = 0) = 1 and the “full” intractable fr (x, t = 1) = fr (x). [sent-332, score-1.222]
55 The most common choice is to set fr (x, t) = [fr (x)]t , but a more complicated construction can be necessary, when fr contains δ-distributions, see appendix E. [sent-333, score-0.54]
56 For later convenience, we have given a subscript to G and ln Z to indicate which approximating distribution µ is being used. [sent-336, score-0.163]
57 We can now use the following simple identity for the free energy G(µ , t) 1 µ µ G(µ , 1) − G(µ , 0) = dt 0 µ dG(µ , t) dt (49) µ µ to relate the Gibbs free energy of the intractable model G(µ ) = G(µ , t = 1) and tractable µ, t = 0). [sent-337, score-1.317]
58 (20), we get model G(µ µ dG(µ , t) ∂ ln Z(λ, t) =− =− dt ∂t d ln fr (x, t) dt . [sent-339, score-0.772]
59 (50) q(x|t) While this representation can be used to re-derive a variational bound approximation (see Appendix F), we will next re-derive a dual representation of the EC free energy by making an approximation similar in spirit to the one used in Section 3. [sent-340, score-0.653]
60 It is defined by r(x|t) = 1 fr (x, t) exp λT g(x) , Zr (λ, t) (51) where, as before the parameters λ are chosen in such a way as to guarantee consistency for the expectations of g, i. [sent-344, score-0.307]
61 g(x) r(x|t) = µ and Zr (λ, t) = dx fr (x, t) exp λT g(x) . [sent-346, score-0.335]
62 2196 (52) Expectation Consistent Approximate Inference Obviously, r(x|t) defines another Gibbs free energy which in its dual representation eq. [sent-347, score-0.485]
63 (44) is given by µ Gr (µ , t) = max − ln Zr (λ, t) + λT µ . [sent-348, score-0.163]
64 (49), we make the approximation 1 d ln fr (x, t) dt dt 0 1 q(x|t) ≈ dt 0 d ln fr (x, t) dt . [sent-350, score-1.249]
65 (47) and (51) contain the same exponential factor fr (x, t) exp λT g(x) allows us to carry out the integral over the interaction strength t on the right hand side of eq. [sent-352, score-0.307]
66 (54) in closed form without specifying the interpolating term fr (x, t) explicitly. [sent-353, score-0.27]
67 (49) and (50) again, but this time for the free energy eq. [sent-355, score-0.485]
68 (53) to get 1 dt 0 d ln fr (x, t) dt r(x|t) µ µ = Gr (µ , 1) − Gr (µ , 0) . [sent-356, score-0.609]
69 (19) Using the duality expression for the free energies eq. [sent-361, score-0.3]
70 (44), the free energy approximation can be written as µ µ µ µ GEC (µ ) = Gq (µ ) + Gr (µ ) − Gs (µ ) = (57) T max min − ln Zq (λq ) − ln Zr (λr ) + ln Zs (λs ) + µ (λq + λr − λs ) λq ,λr λs , µ µ µ µ µ µ where we have defined Gq (µ ) = Gq (µ , 0), Gr (µ ) = Gr (µ , 1) and Gs (µ ) = Gr (µ , 0). [sent-362, score-1.005]
71 To obtain the corresponding approximation for the Helmholtz free energy − ln Z, we should minimize this expression with respect to µ. [sent-363, score-0.679]
72 However, this natural split allows us to develop a double loop algorithm similar to Yuille (2002); Heskes et al. [sent-372, score-0.139]
73 (2003), which is guaranteed to converge to at least one of the stationary points, provided that the EC free energy is bounded from below. [sent-373, score-0.485]
74 Hence, this double loop procedure is equivalent to the one defined in Section 5, demonstrating that the sequence F (λs (t)) yields nondecreasing upper bounds to the minimal EC Gibbs free energy. [sent-378, score-0.343]
75 Tree-Connected Graphs For the EC tree approximation we will need to make inference on tree-connected graphs. [sent-380, score-0.241]
76 Assuming that Λ defines a tree one can express the free energy in terms of single- and two-node free energies (Yedidia et al. [sent-383, score-0.891]
77 , 2001): − ln Z(λ) = − (ij) where λ(ij) = γi g(ij) = xi , xj , − (ij) , γj (ij)∈G (ij) (ij) ln Zij (λ(ij) ) − (ij) , Λii , Λij , Λjj x2 x2 j i 2 , −xi xj , − 2 i (1 − ni ) ln Zi (λ(i) ) , (60) are the parameters associated with the moments and ni is the number of links to node i. [sent-384, score-0.858]
78 The two-node partition function Zij is given by Zij (λ(ij) ) = 2 2 dxi dxj ψi (xi )ψj (xj )eγi xi +γj xj −Λij xi xj −Λii xi /2−Λjj xj /2 . [sent-385, score-0.378]
79 By solving the max condition we can write the Lagrange parameters in terms of the mean values mi = xi and covariances χij = xi xj − mi mj . [sent-389, score-0.353]
80 This will be useful when we derive algorithms for optimizing the free energy in section 5 where we need to solve for λ in terms of µ . [sent-390, score-0.485]
81 Finally, we will also need to make inference about the mean values and covariances on the tree for the binary variables. [sent-392, score-0.21]
82 The mean values and correlations are given by mi = tanh θi + xi xj = e−Λij j,(ij)∈G r(ij)→i cosh(θi\j + θj\i ) − eΛij cosh(θi\j − θj\i ) . [sent-397, score-0.293]
83 Single Loop Algorithmic Recipes In this appendix we give the algorithmic recipes for one sequential algorithm for the factorized EC and a parallel algorithm for tree EC. [sent-399, score-0.178]
84 Send message from r to qi • Calculate separator si : γs,i := mr,i /χr,ii and Λs,i := 1/χr,ii . [sent-402, score-0.136]
85 • Update moments of qi : mq,i := tanh(γq,i ) and χq,ii = 1 − m2 . [sent-404, score-0.118]
86 Send message from qi to r • Calculate separator si : γs,i := mq,i /χq,ii and Λs,i := 1/χq,ii . [sent-406, score-0.136]
87 χr := χr − mr Convergence is reached when and if mr = mq and χr,ii = χq,ii , i = 1, . [sent-410, score-0.205]
88 The only difference is that it is parallel and uses inference on a tree graph, see appendix C for details on the tree inference: • Initialize as above. [sent-416, score-0.316]
89 • Update moments of q: [mq , χq ] := inference binary tree(γ q , Λq ) will only return non-zero elements of the covariance on the tree. [sent-420, score-0.183]
90 • Update moments of r: χr := (Λr − J)−1 and mr := χr (γ r + θ). [sent-424, score-0.161]
91 Convergence is reached when mq = mr and χq = tree(χr ). [sent-425, score-0.123]
92 (9) can be treated by defining the bimodal density t exp − 1−t (x4 − 2x2 ) i i √ fr (x, t) = 1−t i=1 N which interpolates between a constant function for t = 0 and becomes proportional to the Dirac measures eq. [sent-431, score-0.343]
93 Re-deriving the Variational Bound Approximation The choice fr (x, t) = t ln fr (x) for the interpolation can be used for a perturbation expansion µ of the free energy G(µ , t) in powers of t, where at the end one sets t = 1. [sent-435, score-1.239]
94 In this case, one obtains an approximation to the Gibbs free energy given by 1 µ µ G(µ ) ≈ G(µ , 0) − dt 0 d ln fr (x, t) dt q(x|0) µ = G(µ , 0) − ln fr (x) q(x|0) . [sent-438, score-1.558]
95 µ Gvar (µ ) = min {KL(q, p) | g(x) q∈F q = µ } − ln Z . [sent-446, score-0.163]
96 (64) Since we are minimizing in a restricted class of distributions, we obtain the upper bound µ µ G(µ ) ≤ Gvar (µ ) on the Gibbs free energy. [sent-447, score-0.204]
97 TAP Gibbs free energy, belief propagation and o sparsity. [sent-488, score-0.281]
98 Expectation propagation for approximate inference in dynamic Bayesian networks. [sent-512, score-0.186]
99 Semidefinite methods for approximate inference on graphs with cycles. [sent-663, score-0.143]
100 CCCP algorithms to minimize the Bethe and Kikuchi free energies: convergent alternatives to belief propagation. [sent-698, score-0.269]
wordName wordTfidf (topN-words)
[('ij', 0.371), ('ec', 0.321), ('energy', 0.281), ('fr', 0.27), ('opper', 0.241), ('winther', 0.21), ('free', 0.204), ('ln', 0.163), ('fq', 0.146), ('gibbs', 0.135), ('xj', 0.126), ('zq', 0.123), ('tree', 0.106), ('variational', 0.106), ('inference', 0.104), ('gr', 0.102), ('tanh', 0.098), ('gec', 0.098), ('zr', 0.097), ('energies', 0.096), ('gq', 0.096), ('minka', 0.093), ('intractable', 0.091), ('mj', 0.089), ('dt', 0.088), ('mr', 0.082), ('loop', 0.082), ('tractable', 0.08), ('moments', 0.079), ('wainwright', 0.073), ('factorized', 0.072), ('marginals', 0.071), ('mi', 0.069), ('ising', 0.065), ('malzahn', 0.065), ('repulsive', 0.065), ('zij', 0.065), ('bethe', 0.061), ('kl', 0.058), ('double', 0.057), ('heskes', 0.055), ('kikuchi', 0.055), ('ld', 0.055), ('message', 0.053), ('erent', 0.051), ('interpolation', 0.051), ('ep', 0.05), ('gvar', 0.049), ('tap', 0.049), ('wind', 0.049), ('yuille', 0.049), ('sp', 0.049), ('gs', 0.045), ('send', 0.044), ('zs', 0.044), ('cosh', 0.044), ('separator', 0.044), ('propagation', 0.043), ('substructures', 0.041), ('mad', 0.041), ('spanning', 0.041), ('mq', 0.041), ('csat', 0.041), ('approximations', 0.041), ('approximate', 0.039), ('qi', 0.039), ('expectation', 0.038), ('links', 0.038), ('exp', 0.037), ('dg', 0.036), ('yedidia', 0.036), ('boyd', 0.036), ('density', 0.036), ('belief', 0.034), ('jj', 0.033), ('adatap', 0.033), ('albers', 0.033), ('cornford', 0.033), ('dcoup', 0.033), ('fabricius', 0.033), ('gij', 0.033), ('ole', 0.033), ('plefka', 0.033), ('roepstor', 0.033), ('suzuki', 0.033), ('wim', 0.033), ('zard', 0.033), ('update', 0.032), ('approximation', 0.031), ('di', 0.031), ('convergent', 0.031), ('eld', 0.03), ('physics', 0.03), ('lagrange', 0.03), ('consistent', 0.03), ('editors', 0.03), ('integrals', 0.029), ('dx', 0.028), ('jordan', 0.028), ('graphical', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
2 0.15587911 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
3 0.11266566 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
Author: Malte Kuss, Carl Edward Rasmussen
Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC
4 0.08531934 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers
5 0.081682861 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
6 0.078876197 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
8 0.076869808 36 jmlr-2005-Gaussian Processes for Ordinal Regression
9 0.065204032 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
10 0.062395401 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
11 0.058318671 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
12 0.053539064 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
13 0.052436016 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
14 0.047109686 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
15 0.046313088 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
16 0.045173258 22 jmlr-2005-Concentration Bounds for Unigram Language Models
17 0.042198684 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
18 0.040965974 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
19 0.039671559 40 jmlr-2005-Inner Product Spaces for Bayesian Networks
20 0.038328528 7 jmlr-2005-A Unifying View of Sparse Approximate Gaussian Process Regression
topicId topicWeight
[(0, 0.272), (1, -0.234), (2, -0.131), (3, -0.021), (4, -0.151), (5, 0.086), (6, 0.025), (7, 0.049), (8, -0.143), (9, -0.012), (10, 0.003), (11, -0.021), (12, -0.026), (13, -0.026), (14, 0.045), (15, -0.287), (16, -0.138), (17, 0.125), (18, -0.021), (19, 0.093), (20, -0.021), (21, 0.057), (22, 0.023), (23, 0.096), (24, -0.256), (25, 0.002), (26, 0.119), (27, -0.06), (28, -0.088), (29, 0.002), (30, -0.139), (31, 0.093), (32, 0.05), (33, -0.209), (34, 0.053), (35, 0.017), (36, -0.01), (37, 0.073), (38, 0.226), (39, 0.019), (40, -0.018), (41, -0.011), (42, -0.065), (43, -0.118), (44, 0.17), (45, 0.036), (46, 0.013), (47, 0.232), (48, 0.103), (49, -0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.97336823 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
2 0.52264464 71 jmlr-2005-Variational Message Passing
Author: John Winn, Christopher M. Bishop
Abstract: Bayesian inference is now widely established as one of the principal foundations for machine learning. In practice, exact inference is rarely possible, and so a variety of approximation techniques have been developed, one of the most widely used being a deterministic framework called variational inference. In this paper we introduce Variational Message Passing (VMP), a general purpose algorithm for applying variational inference to Bayesian Networks. Like belief propagation, VMP proceeds by sending messages between nodes in the network and updating posterior beliefs using local operations at each node. Each such update increases a lower bound on the log evidence (unless already at a local maximum). In contrast to belief propagation, VMP can be applied to a very general class of conjugate-exponential models because it uses a factorised variational approximation. Furthermore, by introducing additional variational parameters, VMP can be applied to models containing non-conjugate distributions. The VMP framework also allows the lower bound to be evaluated, and this can be used both for model comparison and for detection of convergence. Variational message passing has been implemented in the form of a general purpose inference engine called VIBES (‘Variational Inference for BayEsian networkS’) which allows models to be specified graphically and then solved variationally without recourse to coding. Keywords: Bayesian networks, variational inference, message passing
3 0.39767793 58 jmlr-2005-Multiclass Classification with Multi-Prototype Support Vector Machines
Author: Fabio Aiolli, Alessandro Sperduti
Abstract: Winner-take-all multiclass classifiers are built on the top of a set of prototypes each representing one of the available classes. A pattern is then classified with the label associated to the most ‘similar’ prototype. Recent proposal of SVM extensions to multiclass can be considered instances of the same strategy with one prototype per class. The multi-prototype SVM proposed in this paper extends multiclass SVM to multiple prototypes per class. It allows to combine several vectors in a principled way to obtain large margin decision functions. For this problem, we give a compact constrained quadratic formulation and we propose a greedy optimization algorithm able to find locally optimal solutions for the non convex objective function. This algorithm proceeds by reducing the overall problem into a series of simpler convex problems. For the solution of these reduced problems an efficient optimization algorithm is proposed. A number of pattern selection strategies are then discussed to speed-up the optimization process. In addition, given the combinatorial nature of the overall problem, stochastic search strategies are suggested to escape from local minima which are not globally optimal. Finally, we report experiments on a number of datasets. The performance obtained using few simple linear prototypes is comparable to that obtained by state-of-the-art kernel-based methods but with a significant reduction (of one or two orders) in response time. Keywords: multiclass classification, multi-prototype support vector machines, kernel machines, stochastic search optimization, large margin classifiers
4 0.3047007 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
Author: Malte Kuss, Carl Edward Rasmussen
Abstract: Gaussian process priors can be used to define flexible, probabilistic classification models. Unfortunately exact Bayesian inference is analytically intractable and various approximation techniques have been proposed. In this work we review and compare Laplace’s method and Expectation Propagation for approximate Bayesian inference in the binary Gaussian process classification model. We present a comprehensive comparison of the approximations, their predictive performance and marginal likelihood estimates to results obtained by MCMC sampling. We explain theoretically and corroborate empirically the advantages of Expectation Propagation compared to Laplace’s method. Keywords: Gaussian process priors, probabilistic classification, Laplace’s approximation, expectation propagation, marginal likelihood, evidence, MCMC
5 0.27619496 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
8 0.23193496 69 jmlr-2005-Tutorial on Practical Prediction Theory for Classification
9 0.22471255 36 jmlr-2005-Gaussian Processes for Ordinal Regression
10 0.22006217 16 jmlr-2005-Asymptotics in Empirical Risk Minimization
11 0.20226027 52 jmlr-2005-Loopy Belief Propagation: Convergence and Effects of Message Errors
12 0.18143652 22 jmlr-2005-Concentration Bounds for Unigram Language Models
13 0.16811544 15 jmlr-2005-Asymptotic Model Selection for Naive Bayesian Networks
14 0.15693694 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
15 0.15576838 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
16 0.14496885 57 jmlr-2005-Multiclass Boosting for Weak Classifiers
17 0.14392439 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
18 0.13773467 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
19 0.13699357 51 jmlr-2005-Local Propagation in Conditional Gaussian Bayesian Networks
20 0.13552448 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
topicId topicWeight
[(13, 0.042), (17, 0.018), (19, 0.029), (36, 0.049), (37, 0.041), (42, 0.01), (43, 0.03), (47, 0.012), (52, 0.055), (59, 0.023), (70, 0.05), (80, 0.425), (88, 0.085), (90, 0.03), (94, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.83899009 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
2 0.78745669 22 jmlr-2005-Concentration Bounds for Unigram Language Models
Author: Evgeny Drukh, Yishay Mansour
Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S: |Xk − E[Xk ]| = O k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof. ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ k ln 1 δ |Gk − Mk | = O . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √ |Gk − Mk | = O Proof Let α = k ln m m δ + k ln m m δ . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ √ √ k ln m k ln m k ln m k ln m k δ δ δ δ + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O m 1241 3 2 + ln m δ . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k ln m δ , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have ln 1 m δ . k + ln m δ ∀δ S : |Gk − Mk | = O 2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH 1 1 1 m ln + + δ m m = O 1 m (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ cw ≤ max pw + m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < . m m 4m δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1 2n2 1 + ·O m−1 n1 ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤ 2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax ln 1 δ . m ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w ≤ O bLmax = O Lmax 2b−1 ln 1 δ + ∑ L(k)O b m k=0 b4 ln b δ . m 1249 ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by : αn1...τ cw = 0 mn0 cw −α qw = 1 ≤ cw ≤ τ m cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251 λ ln 8m δ + 1−α 3 ln 8 δ + M0 m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑ λ ln 2m δ ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error 3 ln 2 α δ + M0 + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ cw m pw − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2 2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥ P pw − cw > m 3 ln 2 δ m 3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw : cw P pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ 3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have cw − pw = O m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ ln . m ln 1 1 δ ˜ =O √ , m m k − pw Xw,k = O m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)] 1 Tm Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m 1 Tm−k Tm 1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2 Tm Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1 1 Tm−k k Tm−2k = Θ 1− 1 + k m−k 1 − 1− k m Tm Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k 1 − 1− = k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2 k 1 − m−k k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 . k 1 − m−k k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw λ ln 1−α 8m δ ≤ 3 ln α + M0 + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264
3 0.7773847 25 jmlr-2005-Denoising Source Separation
Author: Jaakko Särelä, Harri Valpola
Abstract: A new algorithmic framework called denoising source separation (DSS) is introduced. The main benefit of this framework is that it allows for the easy development of new source separation algorithms which can be optimised for specific problems. In this framework, source separation algorithms are constructed around denoising procedures. The resulting algorithms can range from almost blind to highly specialised source separation algorithms. Both simple linear and more complex nonlinear or adaptive denoising schemes are considered. Some existing independent component analysis algorithms are reinterpreted within the DSS framework and new, robust blind source separation algorithms are suggested. The framework is derived as a one-unit equivalent to an EM algorithm for source separation. However, in the DSS framework it is easy to utilise various kinds of denoising procedures which need not be based on generative models. In the experimental section, various DSS schemes are applied extensively to artificial data, to real magnetoencephalograms and to simulated CDMA mobile network signals. Finally, various extensions to the proposed DSS algorithms are considered. These include nonlinear observation mappings, hierarchical models and over-complete, nonorthogonal feature spaces. With these extensions, DSS appears to have relevance to many existing models of neural information processing. Keywords: blind source separation, BSS, prior information, denoising, denoising source separation, DSS, independent component analysis, ICA, magnetoencephalograms, MEG, CDMA
4 0.34477422 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
Author: Luís B. Almeida
Abstract: When acquiring an image of a paper document, the image printed on the back page sometimes shows through. The mixture of the front- and back-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind source separation. This paper addresses a difficult version of this problem, corresponding to the use of “onion skin” paper, which results in a relatively strong nonlinearity of the mixture, which becomes close to singular in the lighter regions of the images. The separation is achieved through the MISEP technique, which is an extension of the well known INFOMAX method. The separation results are assessed with objective quality measures. They show an improvement over the results obtained with linear separation, but have room for further improvement. Keywords: ICA, blind source separation, nonlinear mixtures, nonlinear separation, image mixture, image separation
5 0.31340593 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
Author: Onno Zoeter, Tom Heskes
Abstract: We study the problem of learning two regimes (we have a normal and a prefault regime in mind) based on a train set of non-Markovian observation sequences. Key to the model is that we assume that once the system switches from the normal to the prefault regime it cannot restore and will eventually result in a fault. We refer to the particular setting as semi-supervised since we assume the only information given to the learner is whether a particular sequence ended with a stop (implying that the sequence was generated by the normal regime) or with a fault (implying that there was a switch from the normal to the fault regime). In the latter case the particular time point at which a switch occurred is not known. The underlying model used is a switching linear dynamical system (SLDS). The constraints in the regime transition probabilities result in an exact inference procedure that scales quadratically with the length of a sequence. Maximum aposteriori (MAP) parameter estimates can be found using an expectation maximization (EM) algorithm with this inference algorithm in the E-step. For long sequences this will not be practically feasible and an approximate inference and an approximate EM procedure is called for. We describe a Ä?Ĺš‚exible class of approximations corresponding to different choices of clusters in a Kikuchi free energy with weak consistency constraints. Keywords: change point problems, switching linear dynamical systems, strong junction trees, approximate inference, expectation propagation, Kikuchi free energies
6 0.30356511 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
7 0.3030816 20 jmlr-2005-Clustering with Bregman Divergences
8 0.29715124 41 jmlr-2005-Kernel Methods for Measuring Independence
9 0.296345 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
10 0.29163963 71 jmlr-2005-Variational Message Passing
11 0.29004657 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
12 0.28759295 64 jmlr-2005-Semigroup Kernels on Measures
13 0.28400174 36 jmlr-2005-Gaussian Processes for Ordinal Regression
14 0.2802369 39 jmlr-2005-Information Bottleneck for Gaussian Variables
15 0.27960286 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
16 0.27741417 48 jmlr-2005-Learning the Kernel Function via Regularization
17 0.27723387 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
18 0.27703047 3 jmlr-2005-A Classification Framework for Anomaly Detection
19 0.27595463 26 jmlr-2005-Diffusion Kernels on Statistical Manifolds
20 0.27544194 44 jmlr-2005-Learning Module Networks