jmlr jmlr2005 jmlr2005-25 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jaakko Särelä, Harri Valpola
Abstract: A new algorithmic framework called denoising source separation (DSS) is introduced. The main benefit of this framework is that it allows for the easy development of new source separation algorithms which can be optimised for specific problems. In this framework, source separation algorithms are constructed around denoising procedures. The resulting algorithms can range from almost blind to highly specialised source separation algorithms. Both simple linear and more complex nonlinear or adaptive denoising schemes are considered. Some existing independent component analysis algorithms are reinterpreted within the DSS framework and new, robust blind source separation algorithms are suggested. The framework is derived as a one-unit equivalent to an EM algorithm for source separation. However, in the DSS framework it is easy to utilise various kinds of denoising procedures which need not be based on generative models. In the experimental section, various DSS schemes are applied extensively to artificial data, to real magnetoencephalograms and to simulated CDMA mobile network signals. Finally, various extensions to the proposed DSS algorithms are considered. These include nonlinear observation mappings, hierarchical models and over-complete, nonorthogonal feature spaces. With these extensions, DSS appears to have relevance to many existing models of neural information processing. Keywords: blind source separation, BSS, prior information, denoising, denoising source separation, DSS, independent component analysis, ICA, magnetoencephalograms, MEG, CDMA
Reference: text
sentIndex sentText sentNum sentScore
1 Box 9203 FI-02015 HUT, Espoo, Finland Editor: Michael Jordan Abstract A new algorithmic framework called denoising source separation (DSS) is introduced. [sent-9, score-0.785]
2 In this framework, source separation algorithms are constructed around denoising procedures. [sent-11, score-0.785]
3 Both simple linear and more complex nonlinear or adaptive denoising schemes are considered. [sent-13, score-0.527]
4 However, in the DSS framework it is easy to utilise various kinds of denoising procedures which need not be based on generative models. [sent-16, score-0.494]
5 Keywords: blind source separation, BSS, prior information, denoising, denoising source separation, DSS, independent component analysis, ICA, magnetoencephalograms, MEG, CDMA 1. [sent-21, score-0.931]
6 The analysis suggested that the nonlinearity used in FastICA can be interpreted as denoising and taking Bayesian noise filtering as the nonlinearity resulted in fast Bayesian ICA. [sent-49, score-0.554]
7 In this paper, we generalise the denoising interpretation of Valpola and Pajunen (2000) and introduce a source separation framework called denoising source separation (DSS). [sent-54, score-1.57]
8 We show that it is actually possible to construct the source separation algorithms around the denoising methods themselves. [sent-55, score-0.785]
9 Fast and accurate denoising will result in a fast and accurate separation algorithm. [sent-56, score-0.606]
10 In some cases a denoising scheme has been used to post-process the results after separation (see Vigneron et al. [sent-58, score-0.606]
11 , 2003), but in the DSS framework this denoising can be used for the source separation itself. [sent-59, score-0.785]
12 We interpret the nonlinearity as denoising and call this one-unit algorithm DSS. [sent-64, score-0.51]
13 We then introduce some practical denoising functions in Sec. [sent-76, score-0.494]
14 These denoising functions are extensively applied to artificial mixtures (Sec. [sent-78, score-0.514]
15 The E- and M-steps have natural interpretations as denoising of the sources and re-estimation of the mixing vector, respectively, and the derived algorithm provides the starting point for the DSS framework. [sent-142, score-0.673]
16 The second step (9) corresponds to the expectation of s over q(S) and can be seen as denoising based on the model of the sources. [sent-205, score-0.494]
17 (9) can be interpreted as denoising and that have the form (8)–(11) DSS algorithms. [sent-215, score-0.494]
18 Hence, the denoising step (9) ν ss ss ν becomes −1 = sD , (12) s+ = f(s) = s I + σ2 Σ−1 ν ss which corresponds to linear denoising. [sent-227, score-0.791]
19 The denoising step in the DSS algorithm s+ = f(s) is thus equivalent to multiplying the current source estimate s with a constant matrix D. [sent-228, score-0.701]
20 By writing ΛD = ΛD ΛD = Λ∗ Λ∗T and adding VT V = I in the middle, we may split the denoising matrix into two parts: D = D∗ D∗T , 238 D ENOISING S OURCE S EPARATION where D∗ = VΛ∗ VT . [sent-236, score-0.494]
21 The analysis extends to nonlinear denoising under the assumptions that the mixing model holds and there is an infinite amount of data. [sent-271, score-0.578]
22 The above analysis for linear denoising functions makes no assumptions about the data-generating process. [sent-295, score-0.494]
23 As such it does not extend to nonlinear denoising functions because there can be more or less fixed points than the dimensionality of the data, and the fixed points w∗ are not, in general, i orthogonal. [sent-296, score-0.527]
24 T ci (s) (22) The idea of the DSS framework is that the user can tailor the denoising function to the task at hand. [sent-310, score-0.51]
25 The denoising can but need not be based on the E-step (2) derived from a generative model. [sent-311, score-0.494]
26 It is possible to view the nonlinear denoising as linear denoising which is constantly adapted to the source estimate. [sent-314, score-1.2]
27 Then it is enough for the denoising function to make the eigenvalues corresponding to this subspace large compared to the rest but the eigenvalues do not need to differ within the subspace. [sent-321, score-0.66]
28 In practice the separation quality is therefore much better if the local eigenvalues differ significantly around the fixed points and this is often easiest to achieve with nonlinear denoising which utilises a lot of prior information. [sent-325, score-0.711]
29 Note that in this deflation scheme, it is possible to use different kinds of denoising procedures when the sources differ in characteristics. [sent-336, score-0.622]
30 It should be noted, however, that such symmetric orthogonalisation cannot separate sources with linear denoising where the eigenvalues of the sources are globally constant. [sent-338, score-0.837]
31 242 (23) D ENOISING S OURCE S EPARATION If the denoising function is multiplied by a scalar, the convergence of the algorithm does not change in any way because the scaling will be overruled by the normalisation step (11). [sent-347, score-0.529]
32 For linear denoising J(s) = D and hence β does not depend on s. [sent-370, score-0.494]
33 Since the denoising operation presumably preserves some of the signal and noise, it is reasonable to assume that all eigenvalues are originally positive. [sent-379, score-0.649]
34 When the denoising is performed for the source estimates f(s) = sD, the equivalent objective function is ˆ g(s) = sDsT = s fT (s) . [sent-407, score-0.673]
35 This means that functionally equivalent DSS algorithms can be implemented with slightly different denoising functions f(s) and while they would converge exactly to the same results, the approximation (30) might yield completely different values. [sent-412, score-0.494]
36 Obviously, the approxiˆ mation is good if the algorithm turns out to use a denoising similar to f(s). [sent-419, score-0.494]
37 (30) is not possible if different denoising functions are used for different sources because the approximation does not provide a universal scaling. [sent-436, score-0.622]
38 It obeys fully the signal model implicitly defined by the corresponding denoising function f. [sent-457, score-0.577]
39 By using the spectral shift it is therefore 246 D ENOISING S OURCE S EPARATION possible to extract both super- and sub-Gaussian distributions with a denoising scheme which is designed for one type of distribution only. [sent-485, score-0.6]
40 Consider for instance f(s) = tanh s which can be used as a denoising function for sub-Gaussian signals while, as will be further discussed in Sec. [sent-486, score-0.635]
41 3, s − tanh s = −(tanh s − s) is a suitable denoising for super-Gaussian signals. [sent-489, score-0.522]
42 In DSS the type of the overfitted results depends on the denoising criterion. [sent-501, score-0.494]
43 The idea is that the algorithms differ in the denoising function f(s) while the other parts of the algorithm remain mostly the same. [sent-519, score-0.494]
44 Denoising is useful as such and therefore there is a wide literature of sophisticated denoising methods to choose from (see Anderson and Moore, 1979). [sent-520, score-0.494]
45 In the DSS framework, the available denoising methods can be directly applied to source separation, producing better results than purely blind techniques. [sent-524, score-0.735]
46 There are also very general noise reduction techniques such as wavelet denoising (Donoho et al. [sent-525, score-0.522]
47 In this section, we discuss denoising functions ranging from simple but powerful linear ones to sophisticated nonlinear ones with the goal of inspiring others to try out their own denoising methods. [sent-527, score-1.021]
48 Many of the denoising functions discussed in this section are applied in experiments in Sec. [sent-529, score-0.494]
49 The package contains the denoising functions and speedups discussed in this paper and in another paper (Valpola and S¨ rel¨ , 2004). [sent-532, score-0.494]
50 Before proceeding to examples of denoising functions, we note that DSS would not be very useful if very exact denoising would be needed. [sent-534, score-0.988]
51 Fortunately, this is usually not the case and it is enough for the denoising function f(s) to remove more noise than signal (see Hyv¨ rinen et al. [sent-535, score-0.651]
52 Even if the denoising discards parts of the signal or creates nonexistent signals, re-estimation steps restore them. [sent-539, score-0.577]
53 If there is no detailed knowledge about the characteristics of the signals to start with, it is useful to bootstrap the denoising functions. [sent-540, score-0.607]
54 This can be achieved by starting with relatively general signal characteristics and then tuning the denoising functions based on analyses of the structure in the noisy signals extracted in the first phase. [sent-541, score-0.713]
55 In fact, some of the nonlinear DSS algorithms can be regarded as linear DSS algorithms where a linear denoising function is adapted to the sources, leading to nonlinear denoising. [sent-542, score-0.56]
56 1 Detailed Linear Denoising Functions In this section, we consider several detailed, simple but powerful, linear denoising schemes. [sent-544, score-0.494]
57 We introduce the denoisings using the denoising matrix D when feasible. [sent-545, score-0.518]
58 The eigenvalue decomposition (14) shows that any denoising in linear DSS can be implemented as an orthonormal rotation followed by a point-wise scaling of the samples and rotation back to the original space. [sent-547, score-0.548]
59 The eigenvalue decomposition of the denoising matrix D often offers good intuitive insight into the denoising function as well as practical means for its implementation. [sent-548, score-1.042]
60 In such experiments, the denoising can be simply implemented by D = diag(m) , (39) where D refers to the linear denoising matrix in Eq. [sent-555, score-0.988]
61 2 D ENOISING BASED ON F REQUENCY C ONTENT If, on the other hand, signals are characterised by having certain frequency components, one can transform the source estimate to a frequency space, mask the spectrum, e. [sent-564, score-0.53]
62 In the case of linear time-invariant (LTI) filtering, the filtering matrix has a Toeplitz structure and the denoising characteristics are manifested only in the diagonal matrix ΛD , while the transforming matrix V represents a constant rotation. [sent-572, score-0.494]
63 4 D ENOISING OF Q UASIPERIODIC S IGNALS As a final example of denoising based on detailed source characteristics, consider Fig. [sent-603, score-0.673]
64 The denoising proceeds as follows: 5 (a) 0 −5 (b) 5 (c) 0 −5 5 (d) 0 −5 10 (e) 0 −10 0 500 1000 1500 Figure 2: a) Current source estimate s of a quasiperiodic signal b) Peak estimates c) Average signal save (two periods are shown for clarity). [sent-610, score-0.895]
65 This averaging is a form of linear denoising since it can be implemented as matrix multiplication. [sent-629, score-0.494]
66 It would not be easy to see from the denoising matrix D itself that D = DDT . [sent-631, score-0.494]
67 2 Denoising Based on Estimated Signal Variance In the previous section, several denoising schemes were introduced. [sent-638, score-0.494]
68 In all of them, the details of the denoising were assumed to be known. [sent-639, score-0.494]
69 It is as well possible to estimate the denoising specifications from the data. [sent-640, score-0.522]
70 There one iteration of DSS using kurtosis-based denoising is shown. [sent-660, score-0.494]
71 The denoising interpretation suggests that the failure to extract the broad activity is due to a poor estimate of SNR. [sent-673, score-0.559]
72 We suggest the following improvements over the kurtosis-based denoising function (41): 1. [sent-693, score-0.494]
73 DSS using the MAP-based denoising has clearly removed a considerable amount of background noise as well as the lonely spike. [sent-722, score-0.522]
74 The exact details of these improvements are not crucial, but we wanted to show that the denoising interpretation of Eq. [sent-724, score-0.494]
75 This is because there is nothing in the denoising which could discard other sources. [sent-732, score-0.494]
76 Once the source estimate is dominated by one of the original sources and the contributions of the other sources fall closer to the noise level, the values of the mask are smaller for the other original sources possibly still present in the estimated source. [sent-742, score-0.761]
77 However, some aspects of the above denoising, especially smoothing of the total-variance estimate s2 (t), have not been suggested previously although they arise quite naturally from the denoising interpretation. [sent-749, score-0.522]
78 The advantages of the denoising we propose are that σ2 n 254 D ENOISING S OURCE S EPARATION 1 0. [sent-761, score-0.494]
79 1 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 Figure 5: The tanh-based denoising mask 1 − tanh(s)/s is shown together with the variance-based denoising mask proposed here. [sent-770, score-1.272]
80 3 Other Denoising Functions There are cases where the system specification itself suggests some denoising schemes. [sent-777, score-0.494]
81 One simple case of online denoising is presented by moving-average filters. [sent-787, score-0.494]
82 Batch DSS algorithms with temporally symmetric denoising would converge to some particular rotation, but non-symmetric on-line denoising by f(s(t)) = s(t − 1) would keep oscillating between sine and cosine components. [sent-793, score-0.988]
83 It may be desirable to use the information in all sources S for denoising any particular source si . [sent-798, score-0.825]
84 This leads to the following denoising function: s+ = fi (S). [sent-799, score-0.494]
85 This can be achieved by incorporating a neighbourhood denoising rule in DSS, resulting in a topographic ordering of the sources. [sent-802, score-0.494]
86 It is also possible to combine various denoising functions when the sources are characterised by more than one type of structure. [sent-805, score-0.622]
87 As an example, consider the mask-based denoisings where denoising is implemented by multiplying the source point-wise by a mask. [sent-818, score-0.697]
88 (30) should always yield approximately the same value if the mask and source estimate are unrelated, but the value would be greater for cases where the magnitude of the source is correlated with the value of the mask. [sent-838, score-0.528]
89 1 L INEAR D ENOISING In this section, we show how the simple linear denoising schemes described in Sec. [sent-868, score-0.494]
90 The testing author decided to denoise the second source by a simple denoising function f(s) = sign(s). [sent-902, score-0.687]
91 3 S EPARATION R ESULTS In this section, we compare the separation results of the linear denoising (Sec. [sent-910, score-0.606]
92 Better SNR and less talk would probably be achieved by tuning the denoising to the characteristics of each different signal group. [sent-1003, score-0.577]
93 The nonlinearity of the denoising is gradually increased in the first iterations. [sent-1097, score-0.51]
94 This should then result in a kind of averaging denoising when a proper delay is used analogous to the multi-resolution spectrogram DSS described in Sec. [sent-1116, score-0.541]
95 In this paper we developed an algorithmic framework called denoising source separation (DSS). [sent-1176, score-0.785]
96 We showed that denoising can be used for source separation and that the results are often better than with blind approaches. [sent-1177, score-0.847]
97 Furthermore, many blind source separation techniques can be interpreted as DSS algorithms using very general denoising principles. [sent-1179, score-0.847]
98 There is a wide literature on signal denoising to choose from and in some cases denoising would be used for post-processing in any case. [sent-1182, score-1.071]
99 But more importantly, the modular coding style makes it easy to tune the denoising functions to better suit the separation problems at hand and even to build in completely new denoising functions to achieve better performance. [sent-1186, score-1.1]
100 We showed how denoising can be adapted to the observed characteristics of signals extracted with denoising based on vague knowledge. [sent-1188, score-1.124]
wordName wordTfidf (topN-words)
[('dss', 0.669), ('denoising', 0.494), ('source', 0.179), ('mask', 0.142), ('enoising', 0.132), ('sources', 0.128), ('signals', 0.113), ('separation', 0.112), ('valpola', 0.108), ('eparation', 0.108), ('ss', 0.099), ('arel', 0.094), ('ource', 0.089), ('signal', 0.083), ('meg', 0.079), ('denoised', 0.075), ('eigenvalues', 0.072), ('cdma', 0.067), ('ica', 0.064), ('blind', 0.062), ('shift', 0.061), ('snr', 0.061), ('rel', 0.055), ('eigenvalue', 0.054), ('hyv', 0.054), ('qrs', 0.052), ('negentropy', 0.051), ('mixing', 0.051), ('spectrogram', 0.047), ('tot', 0.047), ('rinen', 0.046), ('spectral', 0.045), ('cardiac', 0.042), ('masking', 0.042), ('fastica', 0.042), ('rake', 0.038), ('activity', 0.037), ('normalisation', 0.035), ('frequency', 0.034), ('nonlinear', 0.033), ('variance', 0.033), ('snrs', 0.033), ('sphered', 0.033), ('sphering', 0.033), ('ltered', 0.032), ('jade', 0.032), ('ltering', 0.03), ('tanh', 0.028), ('masks', 0.028), ('sopt', 0.028), ('wnew', 0.028), ('periods', 0.028), ('ation', 0.028), ('noise', 0.028), ('hz', 0.028), ('vig', 0.028), ('estimate', 0.028), ('ft', 0.025), ('peak', 0.025), ('si', 0.024), ('rio', 0.024), ('autocovariance', 0.024), ('denoisings', 0.024), ('sfa', 0.024), ('spiky', 0.024), ('streams', 0.024), ('extracted', 0.023), ('power', 0.022), ('biomedical', 0.022), ('cardoso', 0.022), ('subspace', 0.022), ('instantaneous', 0.021), ('mixtures', 0.02), ('bss', 0.019), ('inen', 0.019), ('ldi', 0.019), ('magnetoencephalograms', 0.019), ('normalise', 0.019), ('procedural', 0.019), ('zzt', 0.019), ('wt', 0.019), ('db', 0.018), ('component', 0.017), ('bit', 0.016), ('matlab', 0.016), ('normalised', 0.016), ('nonlinearity', 0.016), ('ci', 0.016), ('vt', 0.015), ('gaussian', 0.015), ('sent', 0.015), ('kurtosis', 0.015), ('separate', 0.015), ('cortex', 0.014), ('dct', 0.014), ('denoise', 0.014), ('jaakko', 0.014), ('leakage', 0.014), ('mog', 0.014), ('molgedey', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000014 25 jmlr-2005-Denoising Source Separation
Author: Jaakko Särelä, Harri Valpola
Abstract: A new algorithmic framework called denoising source separation (DSS) is introduced. The main benefit of this framework is that it allows for the easy development of new source separation algorithms which can be optimised for specific problems. In this framework, source separation algorithms are constructed around denoising procedures. The resulting algorithms can range from almost blind to highly specialised source separation algorithms. Both simple linear and more complex nonlinear or adaptive denoising schemes are considered. Some existing independent component analysis algorithms are reinterpreted within the DSS framework and new, robust blind source separation algorithms are suggested. The framework is derived as a one-unit equivalent to an EM algorithm for source separation. However, in the DSS framework it is easy to utilise various kinds of denoising procedures which need not be based on generative models. In the experimental section, various DSS schemes are applied extensively to artificial data, to real magnetoencephalograms and to simulated CDMA mobile network signals. Finally, various extensions to the proposed DSS algorithms are considered. These include nonlinear observation mappings, hierarchical models and over-complete, nonorthogonal feature spaces. With these extensions, DSS appears to have relevance to many existing models of neural information processing. Keywords: blind source separation, BSS, prior information, denoising, denoising source separation, DSS, independent component analysis, ICA, magnetoencephalograms, MEG, CDMA
2 0.17053783 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
Author: Motoaki Kawanabe, Klaus-Robert Müller
Abstract: A blind separation problem where the sources are not independent, but have variance dependencies is discussed. For this scenario Hyv¨ rinen and Hurri (2004) proposed an algorithm which requires a no assumption on distributions of sources and no parametric model of dependencies between components. In this paper, we extend the semiparametric approach of Amari and Cardoso (1997) to variance dependencies and study estimating functions for blind separation of such dependent sources. In particular, we show that many ICA algorithms are applicable to the variance-dependent model as well under mild conditions, although they should in principle not. Our results indicate that separation can be done based only on normalized sources which are adjusted to have stationary variances and is not affected by the dependent activity levels. We also study the asymptotic distribution of the quasi maximum likelihood method and the stability of the natural gradient learning in detail. Simulation results of artificial and realistic examples match well with our theoretical findings. Keywords: blind source separation, variance dependencies, independent component analysis, semiparametric statistical models, estimating functions
3 0.15800884 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
Author: Luís B. Almeida
Abstract: When acquiring an image of a paper document, the image printed on the back page sometimes shows through. The mixture of the front- and back-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind source separation. This paper addresses a difficult version of this problem, corresponding to the use of “onion skin” paper, which results in a relatively strong nonlinearity of the mixture, which becomes close to singular in the lighter regions of the images. The separation is achieved through the MISEP technique, which is an extension of the well known INFOMAX method. The separation results are assessed with objective quality measures. They show an improvement over the results obtained with linear separation, but have room for further improvement. Keywords: ICA, blind source separation, nonlinear mixtures, nonlinear separation, image mixture, image separation
4 0.05740419 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
Author: Simone Fiori
Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications. Keywords: differential geometry, diffusion-type gradient, Lie groups, non-negative independent component analysis, Riemannian gradient
5 0.050363213 41 jmlr-2005-Kernel Methods for Measuring Independence
Author: Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, Bernhard Schölkopf
Abstract: We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. Keywords: independence, covariance operator, mutual information, kernel, Parzen window estimate, independent component analysis c 2005 Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet and Bernhard Schölkopf . G RETTON , H ERBRICH , S MOLA , B OUSQUET AND S CHÖLKOPF
6 0.044474185 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
7 0.036670431 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
8 0.036122777 34 jmlr-2005-Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach
9 0.03546603 39 jmlr-2005-Information Bottleneck for Gaussian Variables
10 0.029098233 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
11 0.026835173 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
12 0.0211586 36 jmlr-2005-Gaussian Processes for Ordinal Regression
13 0.020252572 14 jmlr-2005-Assessing Approximate Inference for Binary Gaussian Process Classification
14 0.020079458 32 jmlr-2005-Expectation Consistent Approximate Inference
15 0.018075336 4 jmlr-2005-A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data
16 0.017484026 2 jmlr-2005-A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior
17 0.015711457 64 jmlr-2005-Semigroup Kernels on Measures
18 0.015620717 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
19 0.014930169 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
20 0.014473503 71 jmlr-2005-Variational Message Passing
topicId topicWeight
[(0, 0.145), (1, 0.299), (2, -0.32), (3, -0.021), (4, -0.088), (5, -0.107), (6, 0.082), (7, 0.125), (8, 0.055), (9, -0.017), (10, 0.177), (11, 0.199), (12, -0.124), (13, -0.084), (14, 0.012), (15, -0.133), (16, -0.029), (17, 0.028), (18, -0.102), (19, -0.032), (20, 0.039), (21, 0.051), (22, 0.012), (23, 0.027), (24, -0.017), (25, 0.023), (26, -0.067), (27, 0.075), (28, -0.04), (29, 0.083), (30, 0.038), (31, 0.017), (32, 0.003), (33, 0.049), (34, 0.079), (35, -0.05), (36, -0.041), (37, 0.021), (38, -0.13), (39, 0.156), (40, 0.007), (41, -0.092), (42, -0.084), (43, -0.042), (44, 0.127), (45, 0.111), (46, -0.055), (47, -0.14), (48, -0.026), (49, -0.117)]
simIndex simValue paperId paperTitle
same-paper 1 0.97018325 25 jmlr-2005-Denoising Source Separation
Author: Jaakko Särelä, Harri Valpola
Abstract: A new algorithmic framework called denoising source separation (DSS) is introduced. The main benefit of this framework is that it allows for the easy development of new source separation algorithms which can be optimised for specific problems. In this framework, source separation algorithms are constructed around denoising procedures. The resulting algorithms can range from almost blind to highly specialised source separation algorithms. Both simple linear and more complex nonlinear or adaptive denoising schemes are considered. Some existing independent component analysis algorithms are reinterpreted within the DSS framework and new, robust blind source separation algorithms are suggested. The framework is derived as a one-unit equivalent to an EM algorithm for source separation. However, in the DSS framework it is easy to utilise various kinds of denoising procedures which need not be based on generative models. In the experimental section, various DSS schemes are applied extensively to artificial data, to real magnetoencephalograms and to simulated CDMA mobile network signals. Finally, various extensions to the proposed DSS algorithms are considered. These include nonlinear observation mappings, hierarchical models and over-complete, nonorthogonal feature spaces. With these extensions, DSS appears to have relevance to many existing models of neural information processing. Keywords: blind source separation, BSS, prior information, denoising, denoising source separation, DSS, independent component analysis, ICA, magnetoencephalograms, MEG, CDMA
2 0.73010916 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
Author: Motoaki Kawanabe, Klaus-Robert Müller
Abstract: A blind separation problem where the sources are not independent, but have variance dependencies is discussed. For this scenario Hyv¨ rinen and Hurri (2004) proposed an algorithm which requires a no assumption on distributions of sources and no parametric model of dependencies between components. In this paper, we extend the semiparametric approach of Amari and Cardoso (1997) to variance dependencies and study estimating functions for blind separation of such dependent sources. In particular, we show that many ICA algorithms are applicable to the variance-dependent model as well under mild conditions, although they should in principle not. Our results indicate that separation can be done based only on normalized sources which are adjusted to have stationary variances and is not affected by the dependent activity levels. We also study the asymptotic distribution of the quasi maximum likelihood method and the stability of the natural gradient learning in detail. Simulation results of artificial and realistic examples match well with our theoretical findings. Keywords: blind source separation, variance dependencies, independent component analysis, semiparametric statistical models, estimating functions
3 0.64923763 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
Author: Luís B. Almeida
Abstract: When acquiring an image of a paper document, the image printed on the back page sometimes shows through. The mixture of the front- and back-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind source separation. This paper addresses a difficult version of this problem, corresponding to the use of “onion skin” paper, which results in a relatively strong nonlinearity of the mixture, which becomes close to singular in the lighter regions of the images. The separation is achieved through the MISEP technique, which is an extension of the well known INFOMAX method. The separation results are assessed with objective quality measures. They show an improvement over the results obtained with linear separation, but have room for further improvement. Keywords: ICA, blind source separation, nonlinear mixtures, nonlinear separation, image mixture, image separation
4 0.22330996 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
Author: Simone Fiori
Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications. Keywords: differential geometry, diffusion-type gradient, Lie groups, non-negative independent component analysis, Riemannian gradient
5 0.18025275 70 jmlr-2005-Universal Algorithms for Learning Theory Part I : Piecewise Constant Functions
Author: Peter Binev, Albert Cohen, Wolfgang Dahmen, Ronald DeVore, Vladimir Temlyakov
Abstract: This paper is concerned with the construction and analysis of a universal estimator for the regression problem in supervised learning. Universal means that the estimator does not depend on any a priori assumptions about the regression function to be estimated. The universal estimator studied in this paper consists of a least-square fitting procedure using piecewise constant functions on a partition which depends adaptively on the data. The partition is generated by a splitting procedure which differs from those used in CART algorithms. It is proven that this estimator performs at the optimal convergence rate for a wide class of priors on the regression function. Namely, as will be made precise in the text, if the regression function is in any one of a certain class of approximation spaces (or smoothness spaces of order not exceeding one – a limitation resulting because the estimator uses piecewise constants) measured relative to the marginal measure, then the estimator converges to the regression function (in the least squares sense) with an optimal rate of convergence in terms of the number of samples. The estimator is also numerically feasible and can be implemented on-line. Keywords: distribution-free learning theory, nonparametric regression, universal algorithms, adaptive approximation, on-line algorithms c 2005 Peter Binev, Albert Cohen, Wolfgang Dahmen, Ronald DeVore and Vladimir Temlyakov. B INEV, C OHEN , DAHMEN , D E VORE AND T EMLYAKOV
6 0.16816862 62 jmlr-2005-Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models
7 0.14276339 39 jmlr-2005-Information Bottleneck for Gaussian Variables
9 0.13644293 55 jmlr-2005-Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection
10 0.11558922 3 jmlr-2005-A Classification Framework for Anomaly Detection
11 0.11555056 43 jmlr-2005-Learning Hidden Variable Networks: The Information Bottleneck Approach
12 0.11440792 53 jmlr-2005-Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application
13 0.11014704 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
14 0.096486777 73 jmlr-2005-Working Set Selection Using Second Order Information for Training Support Vector Machines
15 0.094122455 45 jmlr-2005-Learning Multiple Tasks with Kernel Methods
16 0.091919526 12 jmlr-2005-An MDP-Based Recommender System
17 0.08763925 36 jmlr-2005-Gaussian Processes for Ordinal Regression
18 0.087250501 71 jmlr-2005-Variational Message Passing
19 0.087241605 22 jmlr-2005-Concentration Bounds for Unigram Language Models
20 0.084655076 35 jmlr-2005-Frames, Reproducing Kernels, Regularization and Learning
topicId topicWeight
[(1, 0.063), (13, 0.02), (17, 0.019), (19, 0.021), (36, 0.042), (37, 0.033), (42, 0.013), (43, 0.032), (47, 0.015), (52, 0.071), (59, 0.015), (70, 0.037), (80, 0.44), (88, 0.056), (90, 0.018), (94, 0.018)]
simIndex simValue paperId paperTitle
1 0.7958222 32 jmlr-2005-Expectation Consistent Approximate Inference
Author: Manfred Opper, Ole Winther
Abstract: We propose a novel framework for approximations to intractable probabilistic models which is based on a free energy formulation. The approximation can be understood as replacing an average over the original intractable distribution with a tractable one. It requires two tractable probability distributions which are made consistent on a set of moments and encode different features of the original intractable distribution. In this way we are able to use Gaussian approximations for models with discrete or bounded variables which allow us to include non-trivial correlations. These are neglected in many other methods. We test the framework on toy benchmark problems for binary variables on fully connected graphs and 2D grids and compare with other methods, such as loopy belief propagation. Good performance is already achieved by using single nodes as tractable substructures. Significant improvements are obtained when a spanning tree is used instead.
same-paper 2 0.76783121 25 jmlr-2005-Denoising Source Separation
Author: Jaakko Särelä, Harri Valpola
Abstract: A new algorithmic framework called denoising source separation (DSS) is introduced. The main benefit of this framework is that it allows for the easy development of new source separation algorithms which can be optimised for specific problems. In this framework, source separation algorithms are constructed around denoising procedures. The resulting algorithms can range from almost blind to highly specialised source separation algorithms. Both simple linear and more complex nonlinear or adaptive denoising schemes are considered. Some existing independent component analysis algorithms are reinterpreted within the DSS framework and new, robust blind source separation algorithms are suggested. The framework is derived as a one-unit equivalent to an EM algorithm for source separation. However, in the DSS framework it is easy to utilise various kinds of denoising procedures which need not be based on generative models. In the experimental section, various DSS schemes are applied extensively to artificial data, to real magnetoencephalograms and to simulated CDMA mobile network signals. Finally, various extensions to the proposed DSS algorithms are considered. These include nonlinear observation mappings, hierarchical models and over-complete, nonorthogonal feature spaces. With these extensions, DSS appears to have relevance to many existing models of neural information processing. Keywords: blind source separation, BSS, prior information, denoising, denoising source separation, DSS, independent component analysis, ICA, magnetoencephalograms, MEG, CDMA
3 0.7526316 22 jmlr-2005-Concentration Bounds for Unigram Language Models
Author: Evgeny Drukh, Yishay Mansour
Abstract: We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on k its error shows a high-probability bound of approximately O √m . We improve its dependency on k to O √ 4 √k m k + m . We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O 2 −5 which has an error of approximately O m 1 k + √ k m . We derive a combined estimator, , for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show 1 that its error has a high-probability bound of approximately O √m , for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. Keywords: Good-Turing estimators, logarithmic loss, leave-one-out estimation, Chernoff bounds 1. Introduction and Overview Natural language processing (NLP) has developed rapidly over the last decades. It has a wide range of applications, including speech recognition, optical character recognition, text categorization and many more. The theoretical analysis has also advanced significantly, though many fundamental questions remain unanswered. One clear challenge, both practical and theoretical, concerns deriving stochastic models for natural languages. Consider a simple language model, where the distribution of each word in the text is assumed to be independent. Even for such a simplistic model, fundamental questions relating sample size to the learning accuracy are already challenging. This is mainly due to the fact that the sample size is almost always insufficient, regardless of how large it is. To demonstrate this phenomena, consider the following example. We would like to estimate the distribution of first names in the university. For that, we are given the names list of a graduate seminar: Alice, Bob, Charlie, Dan, Eve, Frank, two Georges, and two Henries. How can we use this sample to estimate the distribution of students’ first names? An empirical frequency estimator would c 2005 Evgeny Drukh and Yishay Mansour. D RUKH AND M ANSOUR assign Alice the probability of 0.1, since there is one Alice in the list of 10 names, while George, appearing twice, would get estimation of 0.2. Unfortunately, unseen names, such as Michael, will get an estimation of 0. Clearly, in this simple example the empirical frequencies are unlikely to estimate well the desired distribution. In general, the empirical frequencies estimate well the probabilities of popular names, but are rather inaccurate for rare names. Is there a sample size, which assures us that all the names (or most of them) will appear enough times to allow accurate probabilities estimation? The distribution of first names can be conjectured to follow the Zipf’s law. In such distributions, there will be a significant fraction of rare items, as well as a considerable number of non-appearing items, in any sample of reasonable size. The same holds for the language unigram models, which try to estimate the distribution of single words. As it has been observed empirically on many occasions (Chen, 1996; Curran and Osborne, 2002), there are always many rare words and a considerable number of unseen words, regardless of the sample size. Given this observation, a fundamental issue is to estimate the distribution the best way possible. 1.1 Good-Turing Estimators An important quantity, given a sample, is the probability mass of unseen words (also called “the missing mass”). Several methods exist for smoothing the probability and assigning probability mass to unseen items. The almost standard method for estimating the missing probability mass is the Good-Turing estimator. It estimates the missing mass as the total number of unique items, divided by the sample size. In the names example above, the Good-Turing missing mass estimator is equal 0.6, meaning that the list of the class names does not reflect the true distribution, to put it mildly. The Good-Turing estimator can be extended for higher orders, that is, estimating the probability of all names appearing exactly k times. Such estimators can also be used for estimating the probability of individual words. The Good-Turing estimators dates back to World War II, and were published first in 1953 (Good, 1953, 2000). It has been extensively used in language modeling applications since then (Katz, 1987; Church and Gale, 1991; Chen, 1996; Chen and Goodman, 1998). However, their theoretical convergence rate in various models has been studied only in the recent years (McAllester and Schapire, 2000, 2001; Kutin, 2002; McAllester and Ortiz, 2003; Orlitsky et al., 2003). For estimation of the probability of all words appearing exactly k times in a sample of size m, McAllester and Schapire k (2000) derive a high probability bound on Good-Turing estimator error of approximately O √m . One of our main results improves the dependency on k of this bound to approximately O √ 4 √k + k m m √ k 1 k+ m , We also show that the empirical frequencies estimator has an error of approximately O for large values of k. Based on the two estimators, we derive a combined estimator with an error of √ 4 2 k approximately O m− 5 , for any k. We also derive a weak lower bound of Ω √m for an error of any estimator based on an independent sample. Our results give theoretical justification for using the Good-Turing estimator for small values of k, and the empirical frequencies estimator for large values of k. Though in most applications the Good-Turing estimator is used for very small values of k, for example k ≤ 5, as by Katz (1987) or Chen (1996), we show that it is fairly accurate in a much wider range. 1232 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1.2 Logarithmic Loss The Good-Turing estimators are used to approximate the probability mass of all the words with a certain frequency. For many applications, estimating this probability mass is not the main optimization criteria. Instead, a certain distance measure between the true and the estimated distributions needs to be minimized. The most popular distance measure used in NLP applications is the Kullback-Leibler (KL) divergence. For a true distribution P = {px }, and an estimated distribution Q = {qx }, both over some px set X, this measure is defined as ∑x px ln qx . An equivalent measure, up to the entropy of P, is the 1 logarithmic loss (log-loss), which equals ∑x px ln qx . Many NLP applications use the value of log-loss to evaluate the quality of the estimated distribution. However, the log-loss cannot be directly calculated, since it depends on the underlying distribution, which is unknown. Therefore, estimating log-loss using the sample is important, although the sample cannot be independently used for both estimating the distribution and testing it. The hold-out estimation splits the sample into two parts: training and testing. The training part is used for learning the distribution, whereas the testing sample is used for evaluating the average per-word log-loss. The main disadvantage of this method is the fact that it uses only part of the available information for learning, whereas in practice one would like to use all the sample. A widely used general estimation method is called leave-one-out. Basically, it performs averaging all the possible estimations, where a single item is chosen for testing, and the rest are used for training. This procedure has an advantage of using the entire sample, and in addition it is rather simple and usually can be easily implemented. The existing theoretical analysis of the leave-oneout method (Holden, 1996; Kearns and Ron, 1999) shows general high probability concentration bounds for the generalization error. However, these techniques are not applicable in our setting. 1 We show that the leave-one-out estimation error for the log-loss is approximately O √m , for any underlying distribution and a general family of learning algorithms. It gives a theoretical justification for effective use of leave-one-out estimation for the log-loss. We also analyze the concentration of the log-loss itself, not based of an empirical measure. We address the characteristics of the underlying distribution affecting the log-loss. We find such a characteristic, defining a tight bound for the log-loss value. 1.3 Model and Semantics We denote the set of all words as V , and N = |V |. Let P be a distribution over V , where pw is the probability of a word w ∈ V . Given a sample S of size m, drawn i.i.d. using P, we denote the number of appearances of a word w in S as cS , or simply cw , when a sample S is clear from the context.1 We w define Sk = {w ∈ V : cS = k}, and nk = |Sk |. w For a claim Φ regarding a sample S, we write ∀δ S Φ[S] for P(Φ[S]) ≥ 1 − δ. For some error c ˜ bound function f (·), which holds with probability 1 − δ, we write O( f (·)) for O f (·) ln m , δ where c > 0 is some constant. 1.4 Paper Organization Section 2 shows several standard concentration inequalities, together with their technical applications regarding the maximum-likelihood approximation. Section 3 shows the error bounds for the 1. Unless mentioned otherwise, all further sample-dependent definitions depend on the sample S. 1233 D RUKH AND M ANSOUR k-hitting mass estimation. Section 4 bounds the error for the leave-one-out estimation of the logarithmic loss. Section 5 shows the bounds for the a priori logarithmic loss. Appendix A includes the technical proofs. 2. Concentration Inequalities In this section we state several standard Chernoff-style concentration inequalities. We also show some of their corollaries regarding the maximum-likelihood approximation of pw by pw = cw . ˆ m Lemma 1 (Hoeffding, 1963) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, such that Yi ∈ [bi , bi + di ]. Then, for any ε > 0, P ∑ Yi − E ∑ Yi i >ε i ≤ 2 exp − 2ε2 ∑i di2 . The next lemma is a variant of an extension of Hoeffding’s inequality, by McDiarmid (1989). Lemma 2 Let Y = Y1 , . . . ,Yn be a set of n independent random variables, and f (Y ) such that any change of Yi value changes f (Y ) by at most di , that is sup ∀ j=i,Y j =Y j (| f (Y ) − f (Y )|) ≤ di . Let d = maxi di . Then, ∀δY : | f (Y ) − E[ f (Y )]| ≤ d n ln 2 δ . 2 Lemma 3 (Angluin and Valiant, 1979) Let Y = Y1 , . . . ,Yn be a set of n independent random variables, where Yi ∈ [0, B]. Let µ = E [∑i Yi ]. Then, for any ε > 0, P ∑ Yi < µ − ε ≤ exp − ε2 , 2µB ∑ Yi > µ + ε ≤ exp − ε2 . (2µ + ε)B i P i Definition 4 (Dubhashi and Ranjan, 1998) A set of random variables Y1 , . . . ,Yn is called “negatively associated”, if it satisfies for any two disjoint subsets I and J of {1, . . . , n}, and any two non-decreasing, or any two non-increasing, functions f from R|I| to R and g from R|J| to R: E[ f (Yi : i ∈ I)g(Y j : j ∈ J)] ≤ E[ f (Yi : i ∈ I)]E[g(Y j : j ∈ J)]. The next lemma is based on the negative association analysis. It follows directly from Theorem 14 and Proposition 7 of Dubhashi and Ranjan (1998). 1234 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 5 For any set of N non-decreasing, or N non-increasing functions { fw : w ∈ V }, any Chernoff-style bound on ∑w∈V fw (cw ), pretending that cw are independent, is valid. In particular, Lemmas 1 and 2 apply for {Y1 , ...,Yn } = { fw (cw ) : w ∈ V }. The next lemma shows an explicit upper bound on the binomial distribution probability.2 Lemma 6 Let X ∼ Bin(n, p) be a sum of n i.i.d. Bernoulli random variables with p ∈ (0, 1). Let 1 µ = E[X] = np. For x ∈ (0, n], there exists some function Tx = exp 12x + O x12 , such that ∀k ∈ Tn 1 {0, . . . , n}, we have P(X = k) ≤ √ Tµ Tn−µ . For integral values of µ, the equality is achieved 2πµ(1−p) at k = µ. (Note that for x ≥ 1, we have Tx = Θ(1).) The next lemma deals with the number of successes in independent trials. Lemma 7 (Hoeffding, 1956) Let Y1 , . . . ,Yn ∈ {0, 1} be a sequence of independent trials, with pi = 1 E[Yi ]. Let X = ∑i Yi be the number of successes, and p = n ∑i pi be the average trial success probability. For any integers b and c such that 0 ≤ b ≤ np ≤ c ≤ n, we have c ∑ k=b n k p (1 − p)n−k ≤ P (b ≤ X ≤ c) ≤ 1. k Using the above lemma, the next lemma shows a general concentration bound for a sum of arbitrary real-valued functions of a multinomial distribution components. We show that with a small penalty, any Chernoff-style bound pretending the components being independent is valid.3 We recall that cS , or equivalently cw , is the number of appearances of the word w in a sample S of w size m. Lemma 8 Let {cw ∼ Bin(m, pw ) : w ∈ V } be independent binomial random variables. Let { fw (x) : w ∈ V } be a set of real valued functions. Let F = ∑w fw (cw ) and F = ∑w fw (cw ). For any ε > 0, √ P (|F − E [F]| > ε) ≤ 3 m P F − E F >ε . The following lemmas provide concentration bounds for maximum-likelihood estimation of pw by pw = cw . The first lemma shows that words with “high” probability have a “high” count in the ˆ m sample. Lemma 9 Let δ > 0, and λ ≥ 3. We have ∀δ S: ∀w ∈ V, s.t. mpw ≥ 3 ln 2m , δ |mpw − cw | ≤ ∀w ∈ V, s.t. mpw > λ ln 2m , δ cw > 1− 3mpw ln 3 λ 2m ; δ mpw . 2. Its proof is based on Stirling approximation directly, though local limit theorems could be used. This form of bound is needed for the proof of Theorem 30. 3. The negative association analysis (Lemma 5) shows that a sum of monotone functions of multinomial distribution components must obey Chernoff-style bounds pretending that the components are independent. In some sense, our result extends this notion, since it does not require the functions to be monotone. 1235 D RUKH AND M ANSOUR The second lemma shows that words with “low” probability have a “low” count in the sample. Lemma 10 Let δ ∈ (0, 1), and m > 1. Then, ∀δ S: ∀w ∈ V such that mpw ≤ 3 ln m , we have cw ≤ δ 6 ln m . δ The following lemma derives the bound as a function of the count in the sample (and not as a function of the unknown probability). Lemma 11 Let δ > 0. Then, ∀δ S: cw > 18 ln 4m , δ ∀w ∈ V, s.t. |mpw − cw | ≤ 6cw ln 4m . δ The following is a general concentration bound. Lemma 12 For any δ > 0, and any word w ∈ V , we have ∀δ S, cw − pw < m 3 ln 2 δ . m The following lemma bounds the probability of words that do not appear in the sample. Lemma 13 Let δ > 0. Then, ∀δ S: ∀w ∈ S, / mpw < ln m . δ 3. K-Hitting Mass Estimation In this section our goal is to estimate the probability of the set of words appearing exactly k times in the sample, which we call “the k-hitting mass”. We analyze the Good-Turing estimator, the empirical frequencies estimator, and a combined estimator. ˆ Definition 14 We define the k-hitting mass Mk , its empirical frequencies estimator Mk , and its 4 Good-Turing estimator Gk as Mk = ∑ w∈Sk pw ˆ Mk = k nk m Gk = k+1 nk+1 . m−k The outline of this section is as follows. Definition 16 slightly redefines the k-hitting mass and its estimators. Lemma 17 shows that this redefinition has a negligible influence. Then, we analyze the estimation errors using the concentration inequalities from Section 2. Lemmas 20 and 21 bound the expectation of the Good-Turing estimator error, following McAllester and Schapire (2000). Lemma 23 bounds the deviation of the error, using the negative association analysis. A tighter bound, based on Lemma 8, is achieved at Theorem 25. Theorem 26 analyzes the error of the empirical frequencies estimator. Theorem 29 refers to the combined estimator. Finally, Theorem 30 shows a weak lower bound for the k-hitting mass estimation. 4. The Good-Turing estimator is usually defined as ( k+1 )nk+1 . The two definitions are almost identical for small values m k of k, as their quotient equals 1 − m . Following McAllester and Schapire (2000), our definition makes the calculations slightly simpler. 1236 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Definition 15 For any w ∈ V and i ∈ {0, · · · , m}, we define Xw,i as a random variable equal 1 if cw = i, and 0 otherwise. The following definition concentrates on words whose frequencies are close to their probabilities. Definition 16 Let α > 0 and k > 3α2 . We define Ik,α = pw ∈ Ik,α }. We define: Mk,α = ∑ pw = w∈Sk ∩Vk,α ∑ √ √ k−α k k+1+α k+1 m , m , and Vk,α = {w ∈ V : pw Xw,k , w∈Vk,α Gk,α = k+1 k+1 |Sk+1 ∩Vk,α | = ∑ Xw,k+1 , m−k m − k w∈Vk,α ˆ Mk,α = k k |Sk ∩Vk,α | = ∑ Xw,k . m m w∈Vk,α By Lemma 11, for large values of k the redefinition coincides with the original definition with high probability: Lemma 17 For δ > 0, let α = ˆ ˆ and Mk = Mk,α . 6 ln 4m . For k > 18 ln 4m , we have ∀δ S: Mk = Mk,α , Gk = Gk,α , δ δ Proof By Lemma 11, we have ∀δ S, ∀w : cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln √ 4m = α cw . δ This means that any word w with cw = k has √ √ √ k−α k k+α k k+1+α k+1 ≤ pw ≤ < . m m m √ ˆ Therefore w ∈ Vk,α , completing the proof for Mk and Mk . Since α < k, any word w with cw = k + 1 has √ √ √ k−α k k+1−α k+1 k+1+α k+1 < ≤ pw ≤ , m m m which yields w ∈ Vk,α , completing the proof for Gk . Since the minimal probability of a word in Vk,α is Ω Lemma 18 Let α > 0 and k > 3α2 . Then, |Vk,α | = O 1237 m k k m . , we derive: D RUKH AND M ANSOUR Proof We have α < √ √k . 3 Any word w ∈ Vk,α has pw ≥ |Vk,α | < √ k−α k m > k m 1 1 − √3 . Therefore, m m 1 , =O 1 k 1− √ k 3 which completes the proof. Using Lemma 6, we derive: Lemma 19 Let α > 0 and 3α2 < k ≤ m . Let w ∈ Vk,α . Then, E[Xw,k ] = P(cw = k) = O 2 1 √ k . Proof Since cw ∼ Bin(m, pw ) is a binomial random variable, we use Lemma 6: E[Xw,k ] = P(cw = k) ≤ Tm 1 . 2πmpw (1 − pw ) Tmpw Tm(1−pw ) For w ∈ Vk,α , we have mpw = Ω(k), which implies 3α2 < k ≤ m 2, Tm Tmpw Tm(1−pw ) we have 1 2πmpw (1 − pw ) ≤ = O(1). Since pw ∈ Ik,α and 1 √ 2π k − α k √ k+1+α k+1 m 1− 1 < 1 2πk 1 − √3 1 1 − k+1 1 + √3 m 1 < 1 2πk 1 − √3 1− 1 2 1 +m 1 1 + √3 1 = O √ , k which completes the proof. 3.1 Good-Turing Estimator The following lemma, directly based on the definition of the binomial distribution, was shown in Theorem 1 of McAllester and Schapire (2000). Lemma 20 For any k < m, and w ∈ V , we have pw P(cw = k) = k+1 P(cw = k + 1)(1 − pw ). m−k The following lemma bounds the expectations of the redefined k-hitting mass, its Good-Turing estimator, and their difference. 1238 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 21 Let α > 0 and 3α2 < k < |E[Gk,α ] − E[Mk,α ]| = O √ k m m 2. We have E[Mk,α ] = O 1 √ k , E[Gk,α ] = O 1 √ k , and . Lemma 22 Let δ > 0, k ∈ {1, . . . , m}. Let U ⊆ V , such that |U| = O m . Let {bw : w ∈ U}, such k k that ∀w ∈ U, bw ≥ 0 and maxw∈U bw = O m . Let Xk = ∑w∈U bw Xw,k . We have ∀δ S: |Xk − E[Xk ]| = O k ln 1 δ . m Proof We define Yw,k = ∑i≤k Xw,i be random variable indicating cw ≤ k and Zw,k = ∑i < k. Let Yk = ∑w∈U bwYw,k and Zk = ∑w∈U bw Zw,k . We have ∑ bw Xw,k = ∑ bw [Yw,k − Zw,k ] = Yk − Zk . Xk = w∈U w∈U Both Yk and Zk , can be bounded using the Hoeffding inequality. Since {bwYw,k } and {bw Zw,k } are monotone with respect to {cw }, Lemma 5 applies for them. This means that the concentration of their sum is at least as tight as if they were independent. Recalling that |U| = O m and k k maxw∈U bw = O m , and using Lemma 2 for Yk and Zk , we have δ ∀ 2 S, |Yk − E[Yk ]| = O δ ∀ 2 S, |Zk − E[Zk ]| = O k m m k ln 1 , δ k m m k ln 1 . δ Therefore, |Xk − E[Xk ]| = |Yk − Zk − E[Yk − Zk ]| which completes the proof. ≤ |Yk − E[Yk ]| + |Zk − E[Zk ]| = O k ln 1 δ , m Using the negative association notion, we can show a preliminary bound for Good-Turing estimation error: Lemma 23 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ k ln 1 δ |Gk − Mk | = O . m 1239 D RUKH AND M ANSOUR Proof Let α = 6 ln 8m . By Lemma 17, we have δ δ ∀ 2 S, Gk = Gk,α ∧ Mk = Mk,α . (1) By Lemma 21, |E[Gk − Mk ]| = |E[Gk,α − Mk,α ]| = O √ k m . (2) k+1 By Definition 16, Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . By Lemma 18, we have |Vk,α | = O m . Therefore, using Lemma 22 with k for Mk,α , and with k + 1 for Gk,α , we have k δ ∀ 4 S, |Mk,α − E[Mk,α ]| = O δ ∀ 4 S, |Gk,α − E[Gk,α ]| = O k ln 1 δ m , (3) k ln 1 δ m . (4) Combining Equations (1), (2), (3), and (4), we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ k ln 1 k ln 1 k δ δ = O +O = O , m m m which completes the proof. Lemma 24 Let δ > 0, k > 0. Let U ⊆ V . Let {bw : w ∈ U} be a set of weights, such that bw ∈ [0, B]. Let Xk = ∑w∈U bw Xw,k , and µ = E[Xk ]. We have δ ∀ S, |Xk − µ| ≤ max √ √ 6 m 6 m 4Bµ ln , 2B ln δ δ . Proof By Lemma 8, combined with Lemma 3, we have ε2 B(2µ + ε) √ √ ε2 ε ≤ max 6 m exp − , 6 m exp − 4Bµ 2B √ P(|Xk − µ| > ε) ≤ 6 m exp − 1240 , (5) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS where Equation (5) follows by considering ε ≤ 2µ and ε > 2µ separately. The lemma follows sub- stituting ε = max 4Bµ ln √ 6 m δ , 2B ln √ 6 m δ . We now derive the concentration bound on the error of the Good-Turing estimator. Theorem 25 For δ > 0 and 18 ln 8m < k < m , we have ∀δ S: 2 δ √ |Gk − Mk | = O Proof Let α = k ln m m δ + k ln m m δ . δ 6 ln 8m . Using Lemma 17, we have ∀ 2 S: Gk = Gk,α , and Mk = Mk,α . Recall that δ k+1 Mk,α = ∑w∈Vk,α pw Xw,k and Gk,α = ∑w∈Vk,α m−k Xw,k+1 . Both Mk,α and Gk,α are linear combinations k of Xw,k and Xw,k+1 , respectively, where the coefficients’ magnitude is O m , and the expectation, by Lemma 21, is O 1 √ k . By Lemma 24, we have δ √ k ln m δ m + k ln m δ m , (6) δ 4 √ k ln m δ m + k ln m δ m . (7) ∀ 4 S, |Mk,α − E[Mk,α ]| = O ∀ S, |Gk,α − E[Gk,α ]| = O Combining Equations (6), (7), and Lemma 21, we have ∀δ S: |Gk − Mk | = |Gk,α − Mk,α | ≤ |Gk,α − E[Gk,α ]| + |Mk,α − E[Mk,α ]| + |E[Gk,α ] − E[Mk,α ]| √ √ √ k ln m k ln m k ln m k ln m k δ δ δ δ + + + = O = O , m m m m m which completes the proof. 3.2 Empirical Frequencies Estimator ˆ In this section we bound the error of the empirical frequencies estimator Mk . Theorem 26 For δ > 0 and 18 ln 8m < k < m , we have 2 δ ∀δ S, √ k ln m δ ˆ |Mk − Mk | = O m 1241 3 2 + ln m δ . k D RUKH AND M ANSOUR Proof Let α = δ − ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S: Mk = Mk,α , and Mk = Mk,α . Let Vk,α = δ + k k {w ∈ Vk,α : pw < m }, and Vk,α = {w ∈ Vk,α : pw > m }. Let X− = k − pw Xw,k , m ∑ − w∈Vk,α X+ = ∑ + w∈Vk,α pw − k Xw,k , m and let X? specify either X− or X+ . By the definition, for w ∈ Vk,α we have By Lemma 18, |Vk,α | = O m k |E[X? ]| ≤ k m − pw = O . By Lemma 19, for w ∈ Vk,α we have E[Xw,k ] = O ∑ w∈Vk,α k − pw E[Xw,k ] = O m √ mα k 1 √ k m k =O 1 √ k expectation is O α k . . Therefore, α . k Both X− and X+ are linear combinations of Xw,k , where the coefficients are O √ α k m (8) √ α k m and the . Therefore, by Lemma 24, we have δ 4 ∀ S: |X? − E[X? ]| = O √ α3 k α4 √ + m m k . (9) ˆ By the definition of X− and X+ , Mk,α − Mk,α = X+ − X− . Combining Equations (8) and (9), we δ S: have ∀ ˆ ˆ |Mk − Mk | = |Mk,α − Mk,α | = |X+ − X− | ≤ |X+ − E[X+ ]| + |E[X+ ]| + |X− − E[X− ]| + |E[X− ]| √ 3 √ 3 k 4 k ln m 2 α α α δ √ + = O = O + + m k m m k since √ ab = O(a + b), and we use a = √ α3 k m and b = α . k ln m δ , k 3.3 Combined Estimator In this section we combine the Good-Turing estimator with the empirical frequencies to derive a combined estimator, which is uniformly accurate for all values of k. ˜ Definition 27 We define Mk , a combined estimator for Mk , by ˜ Mk = Gk ˆ Mk 1242 2 k ≤ m5 2 k > m5 . C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 28 (McAllester and Schapire, 2000) Let k ∈ {0, . . . , m}. For any δ > 0, we have ln 1 m δ . k + ln m δ ∀δ S : |Gk − Mk | = O 2 ˜ ˜ The following theorem shows that Mk has an error bounded by O m− 5 , for any k. For small k, 2 2 we use Lemma 28. Theorem 25 is used for 18 ln 8m < k ≤ m 5 . Theorem 26 is used for m 5 < k < m . 2 δ ˜ The complete proof also handles k ≥ m . The theorem refers to Mk as a probability estimator, and 2 does not show that it is a probability distribution by itself. Theorem 29 Let δ > 0. For any k ∈ {0, . . . , m}, we have ˜ ˜ ∀δ S, |Mk − Mk | = O m− 5 . 2 The following theorem shows a weak lower bound for approximating Mk . It applies to estimating Mk based on a different independent sample. This is a very “weak” notation, since Gk , as well ˆ as Mk , are based on the same sample as Mk . Theorem 30 Suppose that the vocabulary consists of k m ), where 1 k √ m. The variance of Mk is Θ k m m k words distributed uniformly (that is pw = . 4. Leave-One-Out Estimation of Log-Loss Many NLP applications use log-loss as the learning performance criteria. Since the log-loss depends on the underlying probability P, its value cannot be explicitly calculated, and must be approximated. The main result of this section, Theorem 32, is an upper bound on the leave-one-out estimation of the log-loss, assuming a general family of learning algorithms. Given a sample S = {s1 , . . . , sm }, the goal of a learning algorithm is to approximate the true probability P by some probability Q. We denote the probability assigned by the learning algorithm to a word w by qw . Definition 31 We assume that any two words with equal sample frequency are assigned equal probabilities in Q, and therefore denote qw by q(cw ). Let the log-loss of a distribution Q be L = 1 1 ∑ pw ln qw = ∑ Mk ln q(k) . w∈V k≥0 Let the leave-one-out estimation, qw , be the probability assigned to w, when one of its instances is removed. We assume that any two words with equal sample frequency are assigned equal leaveone-out probability estimation, and therefore denote qw by q (cw ). We define the leave-one-out estimation of the log-loss as averaging the loss of each sample word, when it is extracted from the sample and pretended to be the test sample: 1243 D RUKH AND M ANSOUR ∑ Lleave−one = w∈V cw knk 1 1 =∑ ln ln . m qw k>0 m q (k) 1 1 Let Lw = L(cw ) = ln q(cw ) , and Lw = L (cw ) = ln q (cw ) . Let the maximal loss be Lmax = max max L(k), L (k + 1) . k In this section we discuss a family of learning algorithms, that receive the sample as an input. Assuming an accuracy parameter δ, we require the following properties to hold: 1. Starting from a certain number of appearances, the estimation is close to the sample frequency. Specifically, for some α, β ∈ [0, 1], ∀k ≥ ln 4m , δ q(k) = k−α . m−β (10) 2. The algorithm is stable when a single word is extracted from the sample: ∀m, 1 , m 1 L (k + 1) − L(k) = O S . n1 2 ≤ k ≤ 10 ln 4m , δ L (k + 1) − L(k) = O ∀m, ∀S s.t. nS > 0, k ∈ {0, 1}, 1 (11) (12) An example of such an algorithm is the following leave-one-out algorithm (we assume that the vocabulary is large enough so that n0 + n1 > 0): qw = N−n0 −1 (n0 +n1 )(m−1) cw −1 m−1 cw ≤ 1 cw ≥ 2. Equation (10) is satisfied by α = β = 1. Equation (11) is satisfied for k ≥ 2 by L(k) − L (k + 1) = 1 = O m . Equation (12) is satisfied for k ≤ 1: ln m−1 m−2 N − n0 − 1 m − 2 N − n0 − 2 m − 1 n0 + n1 + 1 m − 2 |L (2) − L(1)| = ln n0 + n1 m − 1 |L (1) − L(0)| = ln 1 1 1 =O + , N − n0 m n1 1 1 1 =O + . =O n0 + n1 m n1 =O The following is the main theorem of this section. It bounds the deviation between the between the true loss and the leave one out estimate. This bound shows that for a general family of learning algorithms, leave-one-out technique can be effectively used to estimate the logarithmic loss, given the sample only. The estimation error bound decreases roughly in proportion to the square root of the sample size, regardless of the underlying distribution. 1244 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Theorem 32 For a learning algorithm satisfying Equations (10), (11), and (12), and δ > 0, we have: δ ∀ S, |L − Lleave−one | = O Lmax (ln m )4 ln δ m ln m δ δ . The proof of Theorem 32 bounds the estimation error separately for the high-probability and low-probability words. We use Lemma 20 (McAllester and Schapire, 2000) to bound the estimation error for low-probability words. The expected estimation error for the high-probability words is bounded elementarily using the definition of the binomial distribution (Lemma 33). Finally, we use McDiarmid’s inequality (Lemma 2) to bound its deviation. The next lemma shows that the expectation of the leave-one-out method is a good approximation for the per-word expectation of the logarithmic loss. Lemma 33 Let 0 ≤ α ≤ 1, and y ≥ 1. Let Bn ∼ Bin(n, p) be a binomial random variable. Let fy (x) = ln(max(x, y)). Then, 0 ≤ E p fy (Bn − α) − 3p Bn fy (Bn − α − 1) ≤ . n n Proof For a real valued function F (here F(x) = fy (x − α)), we have: E Bn F(Bn − 1) n n = ∑ x=0 n = p∑ x=1 n x x p (1 − p)n−x F(x − 1) x n n − 1 x−1 p (1 − p)(n−1)−(x−1) F(x − 1) x−1 = pE[F(Bn−1 )] , where we used n x x n = E p fy (Bn − α) − n−1 x−1 . Since Bn ∼ Bn−1 + B1 , we have: Bn fy (Bn − α − 1) n = p(E[ fy (Bn−1 + B1 − α)] − E[ fy (Bn−1 − α)]) max(Bn−1 + B1 − α, y) max(Bn−1 − α, y) max(Bn−1 − α + B1 , y + B1 ) ≤ pE ln max(Bn−1 − α, y) B1 = pE ln(1 + ) max(Bn−1 − α, y) B1 ≤ pE . max(Bn−1 − α, y) = pE ln Since B1 and Bn−1 are independent, we get 1245 D RUKH AND M ANSOUR pE B1 1 = pE[B1 ]E max(Bn−1 − α, y) max(Bn−1 − α, y) 1 = p2 E max(Bn−1 − α, y) n−1 = p2 ∑ n−1 x 1 p (1 − p)n−1−x x max(x − α, y) = p2 ∑ x+1 1 n−1 x p (1 − p)n−1−x x + 1 max(x − α, y) x x=0 n−1 x=0 ≤ ≤ n−1 p x+1 n max px+1 (1 − p)n−(x+1) ∑ n x max(x − α, y) x=0 x + 1 3p 3p (1 − (1 − p)n ) < . (13) n n Equation (13) follows by the following observation: x + 1 ≤ 3(x − α) for x ≥ 2, and x + 1 ≤ 2y for x ≤ 1. Finally, pE ln max(Bn−1 −α+B1 ,y) ≥ 0, which implies the lower bound of the lemma. max(Bn−1 −α,y) The following lemma bounds n2 as a function of n1 . Lemma 34 Let δ > 0. We have ∀δ S: n2 = O 3 5 Theorem 32 Proof Let yw = 1 − m ln 1 + n1 ln m . δ δ δ pw m − 2. By Lemma 9, with λ = 5, we have ∀ 2 S: 3 ln 4m 3pw ln 4m δ δ pw − cw ≤ , m m m √ 5 ln 4m δ ∀w ∈ V : pw > , cw > yw + 2 ≥ (5 − 15) ln 4m > ln 4m . δ δ m ∀w ∈ V : pw > Let VH = w ∈ V : pw > 5 ln 4m δ m |L − Lleave−one | ≤ (14) (15) and VL = V \VH . We have ∑ w∈VH pw Lw − cw L m w + ∑ w∈VL pw Lw − cw L m w . (16) We start by bounding the first term of Equation (16). By Equation (15), we have ∀w ∈ VH , cw > m−β w −α yw + 2 > ln 4m . Equation (10) implies that qw = cm−β , therefore Lw = ln cm−β = ln max(cw −α,yw ) , and δ w −α m−1−β Lw = ln cm−1−β = ln max(cw −1−α,yw ) . Let w −1−α H Errw = cw m−β m−β ln − pw ln . m max(cw − 1 − α, yw ) max(cw − α, yw ) We have 1246 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ w∈VH cw L − pw Lw m w m−1−β cw ∑ m − β w∈VH m ∑ H Errw + ln ∑ = H Errw + O w∈VH ≤ w∈VH 1 . m (17) H We bound ∑w∈VH Errw using McDiarmid’s inequality. As in Lemma 33, let fw (x) = ln(max(x, yw )). We have H E Errw = ln(m − β)E cw cw − pw + E pw fw (cw − α) − fw (cw − 1 − α) . m m The first expectation equals 0, the second can be bounded using Lemma 33: w∈VH H E Errw ≤ ∑ w∈VH E pw fw (cw − α) − ≤ ∑ ∑ cw fw (cw − 1 − α) m 3pw 1 =O . m m w∈VH (18) H In order to use McDiarmid’s inequality, we bound the change of ∑w∈VH Errw as a function of a single change in the sample. Suppose that a word u is replaced by a word v. This results in decrease H for cu , and increase for cv . Recalling that yw = Ω(mpw ), the change of Erru , as well as the change H , is bounded by O ln m , as follows: of Errv m m−β The change of pu ln max(cu −α,yu ) would be 0 if cu − α ≤ yu . Otherwise, pu ln m−β m−β − pu ln max(cu − 1 − α, yu ) max(cu − α, yu ) ≤ pu [ln(cu − α) − ln(cu − 1 − α)] = pu ln 1 + 1 cu − 1 − α =O pu cu . m−β pu 1 Since cu ≥ yu = Ω(mpu ), the change is bounded by O( cu ) = O( m ). The change of cu ln max(cu −1−α,yu ) m would be O( ln m ) if cu − 1 − α ≤ yu . Otherwise, m m−β cu m−β cu − 1 ln − ln m max(cu − 2 − α, yu ) m max(cu − 1 − α, yu ) m−β 1 m−β m−β cu − 1 ln + ln − ln ≤ m max(cu − 2 − α, yu ) max(cu − 1 − α, yu ) m max(cu − 1 − α, yu ) ln m cu − 1 1 ln m ln 1 + + =O . ≤ m cu − 2 − α m m H The change of Errv is bounded in a similar way. 1247 D RUKH AND M ANSOUR δ By Equations (17) and (18), and Lemma 2, we have ∀ 16 S: ∑ w∈VH ≤ cw L − pw Lw m w ∑ w∈VH ≤ O ∑ H Errw − E ln m m H Errw ∑ + E H Errw +O w∈VH w∈VH 1 1 1 m ln + + δ m m = O 1 m (ln m)2 ln 1 δ . m (19) δ Next, we bound the second term of Equation (16). By Lemma 10, we have ∀ 4 S: ∀w ∈ V s.t. pw ≤ 3 ln 4m δ , cw ≤ 6 ln 4m . δ m (20) b Let b = 5 ln 4m . By Equations (14) and (20), for any w such that pw ≤ m , we have δ cw ≤ max pw + m 3pw ln m 4m δ √ (5 + 3 ∗ 5) ln 4m 2b δ ≤ < . m m 4m δ 6 ln , m k L Therefore ∀w ∈ VL , we have cw < 2b. Let nL = |VL ∩Sk |, GL = m−k+1 nL , and Mk = ∑w∈VL ∩Sk pw . k k k−1 We have ∑ w∈VL cw L − pw Lw m w 2b = 2b−1 knL L k L (k) − ∑ Mk L(k) ∑ m k=1 k=0 ≤ ∑ m − kk+ 1 L (k) − ∑ MkL L(k) 2b k=1 2b = k=0 2b−1 ∑ GL L (k) − ∑ MkL L(k) k−1 k=1 2b−1 = 2b−1 knL +O k=0 2b−1 k=0 k=0 1 1 − m−k+1 m bLmax m +O k=0 2b−1 ≤ k=1 2b−1 ∑ GL L (k + 1) − ∑ MkL L(k) k k=0 2b + ∑ knL L (k) k bLmax m ∑ GL |L (k + 1) − L(k)| + ∑ |GL − MkL |L(k) + O k k bLmax m . The first sum of Equation (21) is bounded using Equations (11) and (12), and Lemma 34: 1248 (21) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 2b−1 = ∑ GL |L (k + 1) − L(k)| + G0|L (1) − L(0)| + G1|L (2) − L(1)|. k (22) k=2 The first term of Equation (22) is bounded by Equation (11): 2b−1 ∑ k=2 2b−1 ∑ GL · O k GL |L (k + 1) − L(k)| ≤ k k=2 1 m =O 1 . m δ The other two terms are bounded using Lemma 34. For n1 > 0, we have ∀ 16 S, n2 = O b By Equation (12), we have G0 |L (1) − L(0)| + G1 |L (2) − L(1)| ≤ n1 1 ·O m n1 2n2 1 + ·O m−1 n1 ln 1 δ . m = O b (23) m ln 1 + n1 δ (24) For n1 = 0, Lemma 34 results in n2 = O b m ln 1 , and Equation (24) transforms into δ G1 |L (2) − L(1)| ≤ 2n2 Lmax = O bLmax m−1 Equations (22), (23), (24), and (25) sum up to 2b−1 ∑ GL |L (k + 1) − L(k)| k k=0 = O bLmax ln 1 δ . m ln 1 δ . m (25) (26) The second sum of Equation (21) is bounded using Lemma 28 separately for every k < 2b with δ L accuracy 16b . Since the proof of Lemma 28 also holds for GL and Mk (instead of Gk and Mk ), we k δ L have ∀ 8 S, for every k < 2b, |GL − Mk | = O b k ln b δ m . Therefore, together with Equations (21) and (26), we have ∑ w∈VL cw L − pw Lw m w ≤ O bLmax = O Lmax 2b−1 ln 1 δ + ∑ L(k)O b m k=0 b4 ln b δ . m 1249 ln b bLmax δ +O m m (27) . D RUKH AND M ANSOUR The proof follows by combining Equations (16), (19), and (27). 5. Log-Loss A Priori Section 4 bounds the error of the leave-one-out estimation of the log-loss. It shows that the log-loss can be effectively estimated, for a general family of learning algorithms. Another question to be considered is the log-loss distribution itself, without the empirical estimation. That is, how large (or low) is it expected to be, and which parameters of the distribution affect it. We denote the learning error (equivalent to the log-loss) as the KL-divergence between the true and the estimated distribution. We refer to a general family of learning algorithms, and show lower and upper bounds for the learning error. The upper bound (Theorem 39) can be divided to three parts. The first part is the missing mass. The other two build a trade-off between a threshold (lower thresholds leads to a lower bound), and the number of words with probability exceeding this threshold (fewer words lead to a lower bound). It seems that this number of words is a necessary lower bound, as we show at Theorem 35. 1 Theorem 35 Let the distribution be uniform: ∀w ∈ V : pw = N , with N m. Also, suppose that the learning algorithm just uses maximum-likelihood approximation, meaning qw = cw . Then, a typical m learning error would be Ω( N ). m The proof of Theorem 35 bases on the Pinsker inequality (Lemma 36). It first shows a lower bound for L1 norm between the true and the expected distributions, and then transforms it to the form of the learning error. Lemma 36 (Pinsker Inequality) Given any two distributions P and Q, we have 1 KL(P||Q) ≥ (L1 (P, Q))2 . 2 Theorem 35 Proof We first show that L1 (P, Q) concentrates near Ω N m . Then, we use Pinsker inequality to show lower bound5 of KL(P||Q). First we find a lower bound for E[|pw − qw |]. Since cw is a binomial random variable, σ2 [cw ] = m mpw (1 − pw ) = Ω N , and with some constant probability, |cw − mpw | > σ[cw ]. Therefore, we have E[|qw − pw |] = ≥ E 1 E[|cw − mpw |] m 1 1 σ[cw ]P(|cw − mpw | > σ[cw ]) = Ω m m ∑ |pw − qw | w∈V = Ω N√ 1 mN =Ω 1 =Ω √ mN m N N m . 5. This proof does not optimize the constants. Asymptotic analysis of logarithmic transform of binomial variables by Flajolet (1999) can be used to achieve explicit values for KL(P||Q). 1250 C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 2 A single change in the sample changes L1 (P, Q) by at most m . Using McDiarmid inequality 1 (Lemma 2) on L1 (P, Q) as a function of sample words, we have ∀ 2 S: L1 (P, Q) ≥ E[L1 (P, Q)] − |L1 (P, Q) − E[L1 (P, Q)]| √ N m N −O =Ω . = Ω m m m Using Pinsker inequality (Lemma 36), we have 1 2 ∀ S, pw 1 ∑ pw ln qw ≥ 2 w∈V 2 ∑ |pw − qw | =Ω w∈V N , m which completes the proof. Definition 37 Let α ∈ (0, 1) and τ ≥ 1. We define an (absolute discounting) algorithm Aα,τ , which α “removes” m probability mass from words appearing at most τ times, and uniformly spreads it among the unseen words. We denote by n1...τ = ∑τ ni the number of words with count between 1 i=1 and τ. The learned probability Q is defined by : αn1...τ cw = 0 mn0 cw −α qw = 1 ≤ cw ≤ τ m cw τ < cw . m The α parameter can be set to some constant, or to make the missing mass match the Good1...τ Turing missing mass estimator, that is αnm = n1 . m Definition 38 Given a distribution P, and x ∈ [0, 1], let Fx = ∑w∈V :pw ≤x pw , and Nx = |{w ∈ V : pw > x}|. Clearly, for any distribution P, Fx is a monotone function of x, varying from 0 to 1, and Nx is a monotone function of x, varying from N to 0. Note that Nx is bounded by 1 . x The next theorem shows an upper bound for the learning error. √ Theorem 39 For any δ > 0 and λ > 3, such that τ < (λ − 3λ) ln 8m , the learning error of Aα,τ is δ bounded ∀δ S by pw 0 ≤ ∑ pw ln qw w∈V ≤ M0 ln + n0 ln 4m δ αn1...τ α F 8m + 1 − α λ lnm δ 1251 λ ln 8m δ + 1−α 3 ln 8 δ + M0 m 3 ln 8 3λ ln 8m δ δ + √ N λ ln 8m . √ δ m 2( λ − 3)2 m m D RUKH AND M ANSOUR The proof of Theorem 39 bases directly on Lemmas 40, 41, and 43. We can rewrite this bound roughly as Nλ λ ˜ ≤ O M0 + √ + m m m pw qw ∑ pw ln w∈V . This bound implies the characteristics of the distribution influencing the log-loss. It shows that a “good” distribution can involve many low-probability words, given that the missing mass is low. However, the learning error would increase if the dictionary included many mid-range3 probability words. For example, if a typical word’s probability were m− 4 , the bound would become 1 ˜ O M0 + m− 4 . Lemma 40 For any δ > 0, the learning error for non-appearing words can be bounded with high probability by ∑ ∀δ S, pw ln w∈S / pw qw ≤ M0 ln n0 ln m δ αn1...τ . Proof By Lemma 13, we have ∀δ S, the real probability of any non-appearing word does not exceed ln m δ m . Therefore, ∑ pw ln w∈S / pw qw m n0 α n1...τ ∑ pw ln pw ∑ pw ln = ln m m n0 δ m α n1...τ w∈S / ≤ w∈S / = M0 ln n0 ln m δ αn1...τ , which completes the proof. Lemma 41 Let δ > 0, λ > 0. Let VL = w ∈ V : pw ≤ λ ln 2m δ m for VL can be bounded with high probability by ∀δ S, ∑ pw ln w∈VL pw qw Proof We use ln(1 + x) ≤ x. ∑ λ ln 2m δ ≤ 1−α pw ln w∈VL For any appearing word w, qw ≥ 1−α m . pw qw ≤ ∑ w∈VL Therefore, 1252 , and VL = VL ∩ S. The learning error 3 ln 2 α δ + M0 + F 2m . m 1 − α λ lnm δ pw pw − qw . qw C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS pw − qw qw pw w∈VL ≤ m ∑ pw (pw − qw ) 1 − α w∈V = ∑ m 1−α L ∑ w∈VL ≤ m 1−α ≤ m λ ln 1−α m ≤ λ ln 1−α pw pw − ∑ pw pw − w∈VL 2m δ 2m δ ∑ w∈VL L cw m + pw − cw m α m ∑ pw m 1 − α w∈V L pw − w∈VL ∑ cw cw + ∑ pw − qw m m w∈V cw m + α ∑ pw 1 − α w∈V L + α F 2m . 1 − α λ lnm δ (28) We apply Lemma 12 on vL , the union of words in VL . Let pvL = ∑w∈VL pw and cvL = ∑w∈VL cw . We have ∀δ S: ∑ w∈VL pw − cw m w∈VL = pw − cw cw − ∑ pw − m m w∈V \S cw m ∑ L ≤ ∑ w∈VL pw − ≤ pvL − cvL + M0 m ≤ + ∑ pw w∈VL \S 3 ln 2 δ + M0 . m (29) The proof follows combining Equations (28) and (29). 2 x Lemma 42 Let 0 < ∆ < 1. For any x ∈ [−∆, ∆], we have ln(1 + x) ≥ x − 2(1−∆)2 . √ Lemma 43 Let δ > 0, λ > 3, such that τ < (λ − 3λ) ln 4m . Let the high-probability words set be δ VH = w ∈ V : pw > probability by λ ln 4m δ m ∀δ S, , and VH = VH ∩ S. The learning error for VH can be bounded with high ∑ w∈VH pw ln pw qw ≤ 3 ln 4 3λ ln 4m δ δ + √ N λ ln 4m . √ δ m 2( λ − 3)2 m m 1253 D RUKH AND M ANSOUR Proof ∑ w∈VH pw pw ln qw ∑ pw ln ∑ = pw = pw ln + cw m w∈VH mpw cw w∈VH ∑ pw ln w∈VH ∑ + cw m qw cw . cw − α pw ln w∈VH ,cw ≤τ (30) δ Using Lemma 9 with λ, we have ∀ 2 S: 3pw ln 4m cw δ ≤ , (31) m m √ 4m ∀w ∈ VH , cw ≥ (λ − 3λ) ln . δ √ This means that for a reasonable choice of τ (meaning τ < (λ − 3λ) ln 4m ), the second term of δ Equation (30) is 0, and VH = VH . Also, pw − ∀w ∈ VH , cw m 3pw ln 4m δ ≤ m − pw 1 ≤ pw pw Therefore, we can use Lemma 42 with ∆ = ∑ w∈VH pw ln mpw cw = − ≤ − = ∑ m 3 ln 2m δ = 2m m λ ln δ 3 λ: pw ln 1 + cw m w∈VH ∑ w∈VH ∑ w∈VH 3 . λ cw m pw − pw pw − pw − pw cw + pw − m cw m 1 2 1− 3 λ λ 2 √ √ λ− 3 2 2 ∑ w∈VH − pw pw cw m − pw pw 2 2 . (32) We apply Lemma 12 on the vH , the union of all words in VH . Let pvH = ∑w∈VH pw and cvH = ∑w∈VH cw . The bound on the first term of Equation (32) is: δ ∀ 2 S, ∑ w∈VH pw − cw m = pvH − cvH ≤ m 3 ln 4 δ . m Assuming that Equation (31) holds, the second term of Equation (32) can also be bounded: 1254 (33) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ cw m w∈VH − pw pw 2 ≤ ∑ w∈VH 3 ln 4m 1 3pw ln 4m δ δ = N λ ln 4m . δ pw m m m (34) The proof follows by combining Equations (30), (32), (33) and (34). Acknowledgments This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation and an IBM faculty award. This publication only reflects the authors’ views. We are grateful to David McAllester for his important contributions in the early stages of this research. Appendix A. Technical Proofs A.1 Concentration Inequalities Lemma 6 Proof We use Stirling approximation Γ(x + 1) = Tx = exp P(X = k) = ≤ = = = 1 1 +O 2 12x x √ 2πx x x e Tx , where . n k p (1 − p)n−k k Γ(n + 1) µ µ n − µ n−µ Γ(µ + 1)Γ(n − µ + 1) n n √ n µµ (n − µ)n−µ Tn 2πn n √ 2πµ 2π(n − µ) µµ (n − µ)n−µ nµ nn−µ Tµ Tn−µ √ 1 2πµ n Tn n − µ Tµ Tn−µ Tn 1 . 2πµ(1 − p) Tµ Tn−µ Clearly, for integral values of µ, the equality is achieved at k = µ. Lemma 8 Proof Let m = ∑w∈V cw . Using Lemma 7 for m with b = c = E[m ] = m, the prob1 ability P(m = m) achieves its minimum when ∀w ∈ V, pw = N . Under this assumption, we have 1 m ∼ Bin(mN, N ). Using Lemma 6, we have 1255 D RUKH AND M ANSOUR P m =m = 1 1 1 2πmN N 1 − N TmN 1 ≥ √ . Tm TmN−m 3 m Therefore, for any distribution {pw : w ∈ V }, we have 1 P(m = m) ≥ √ . 3 m Obviously, E[F ] = ∑w E[ fw (cw )] = E[F]. Also, the distribution of {cw } given that m = m is identical to the distribution of {cw }, therefore the distribution of F given that m = m is identical to the distribution of F. We have P(|F − E[F ]| > ε) = ∑ P(m i = i)P(|F − E[F ]| > ε|m = i) ≥ P(m = m)P(|F − E[F ]| > ε|m = m) = P(m = m)P(|F − E[F]| > ε) 1 √ P(|F − E[F]| > ε), ≥ 3 m which completes the proof. Lemma 44 For any δ > 0, and a word w ∈ V , such that pw ≥ P pw − cw > m 3 ln 2 δ m 3pw ln 2 δ ≤ δ. m Proof The proof follows by applying Lemma 3, substituting ε = mpw we have ε ≤ mpw : cw P pw − ≥ m , we have 3mpw ln 2 . Note that for 3 ln 2 ≤ δ δ 3pw ln 2 δ = P (|mpw − cw | ≥ ε) m ≤ 2 exp − ε2 2E[cw ] + ε ≤ 2 exp − 3mpw ln 2 δ 3mpw which completes the proof. 1256 = δ, C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 3 ln 2m Lemma 9 Proof There are at most m words with probability pw ≥ m δ . The first claim follows δ using Lemma 44 together with union bound over all these words (with accuracy m for each word). Using the first claim, we derive the second. We show a lower bound for 3pw ln 2m δ > pw − pw m cw ≥ pw − m 3 = λ 1− 3 λ cw m, using ln 2m δ m 1 < λ pw : pw . The final inequality follows from simple algebra. Lemma 10 Proof Let b = 3 ln( m ). Note that δ ∈ [0, 1] and m > 1 yield b > 2. First, suppose that δ b there are up to m words with pw ≤ m . For each such word, we apply Lemma 3 on cw , with ε = b. We have: P cw > 6 ln m b2 ≤ P (cw > mpw + ε) ≤ exp − δ 2mpw + b ≤ δ . m Since we assume that there are up to m such words, the total mistake probability is δ. Now we assume the general case, that is, without any assumption on the number of words. Our goal is to reduce the problem to the former conditions, that is, to create a set of size m of words with b probability smaller than m . We first create m empty sets v1 , . . . , vm . Let the probability of each set vi , pvi , be the sum of the probabilities of all the words it includes. Let the actual count of vi , cvi , be the sum of the sample counts of all the words w it includes. We divide all the words w between these sets in a bin-packing-approximation manner. We sort the words w in decreasing probability order. Then, we do the following loop: insert the next word w to the set vi with the currently smallest pvi . b We claim that pvi ≤ m for each vi at the end of the loop. If this inequality does not hold, then b some word w made this “overflow” first. Obviously, pw must be smaller than 2m , otherwise it would b be one of the first 2m < m words ordered, and would enter an empty set. If pw < 2m and it made b b an “overflow”, then the probability of each set at the moment w was entered must exceed 2m , since w must have entered the lightest set available. This means that the total probability of all words b entered by that moment was greater than m 2m > 1. Applying the case of m words to the sets v1 , . . . , vm , we have ∀δ S: for every vi , cvi ≤ 2b. Also, if the count of each set vi does not exceed 2b, so does the count of each word w ∈ vi . That is, P ∃w : pw ≤ b b , cw > 2b ≤ P ∃vi : pvi ≤ , cvi > 2b ≤ δ, m m which completes the proof. 1257 D RUKH AND M ANSOUR δ Lemma 11 Proof By Lemma 9 with some λ > 3 (which will be set later), we have ∀ 2 S: 3 ln 4m δ , m λ ln 4m δ ∀w : pw > , m ∀w : pw ≥ By Equation (35), for any word w such that √ λ + 3λ ln 4m . By Lemma 10, we have δ |mpw − cw | ≤ cw > 3 ln 4m δ m 3mpw ln 3 λ 1− ≤ pw ≤ 4m , δ mpw . λ ln 4m δ m (35) (36) , we have cw ≤ mpw + 3mpw ln 4m ≤ δ 4m . δ √ It means that for any w : mpw ≤ λ ln 4m , we have cw ≤ λ + 3λ ln 4m . This means that for δ δ √ 4m 4m any w such that cw > λ + 3λ ln δ , we have mpw > λ ln δ . By Equation (36), this means δ ∀ 2 S, ∀w s.t. pw ≤ mpw ≤ 1− 3 ln 4m δ m , cw ≤ 6 ln 1 √ 3 cw , and by Equation (35): λ |mpw − cw | ≤ 4m 3mpw ln ≤ δ 3cw ln 4m δ 1− 3 λ = √ 3cw λ ln 4m √ √δ . λ− 3 Substituting λ = 12 results in ∀δ S : ∀w s.t. cw > 18 ln 4m , δ |mpw − cw | ≤ 6cw ln 4m , δ which completes the proof. Lemma 12 Proof If pw ≥ 3 ln 2 δ m ∀δ S, , we can apply Lemma 44. We have cw − pw ≤ m 3pw ln 2 δ ≤ m 3 ln 2 δ . m Otherwise, we can apply Lemma 10. We have: ∀δ S, cw cw − pw ≤ max pw , m m which completes the proof. 1258 ≤ 6 ln m δ ≤ m 3 ln 2 δ , m C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS Lemma 13 Proof Let b = ln m . We note that there are at most δ P ∃w : cw = 0, pw ≥ b m ≤ ∑ m b b words with probability pw ≥ m . P(cw = 0) b w:pw ≥ m ∑ = m b (1 − pw ) ≤ 1− b m b m m < me−b = δ, w:pw ≥ m which completes the proof. A.2 K-Hitting Mass Estimation Lemma 21 Proof We have ∑w∈Vk,α pw ≤ 1. Using Lemma 19, we bound P(cw = k) and P(cw = k + 1): E[Mk,α ] = 1 pw P(cw = k) = O √ k w∈Vk,α ∑ k+1 P(cw = k + 1) − pw P(cw = k) w∈Vk,α m − k √ k+1 k = ∑ pw . P(cw = k + 1) = O m−k m w∈Vk,α |E[Gk,α ] − E[Mk,α ]| = ∑ Equation (37) follows by Lemma 20. By Lemma 18, we have |Vk,α | = O E[Gk,α ] = km 1 k+1 ∑ P(cw = k + 1) = O m k √k m − k w∈Vk,α m k (37) : 1 =O √ , k which completes the proof. Theorem 29 Proof The proof is done by examining four cases of k. For k ≤ 18 ln 8m , we can use δ Lemma 28. We have ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O ln 1 δ m k + ln m δ ˜ =O 1 √ m . 2 For 18 ln 8m < k ≤ m 5 , we can use Theorem 25. We have δ ˜ ∀δ S, |Mk − Mk | = |Gk − Mk | = O 1259 √ k ln m δ m + k ln m δ m 2 ˜ = O m− 5 . D RUKH AND M ANSOUR 2 For m 5 < k < m , we can use Theorem 26. We have 2 δ ˜ ˆ ∀ S, |Mk − Mk | = |Mk − Mk | = O For k ≥ m , let α = 2 √ 3 k(ln m ) 2 δ m + √ ln m δ k 2 ˜ = O m− 5 . δ ˆ ˆ 6 ln 8m . By Lemma 17, we have ∀ 2 S, Mk = Mk,α ∧ Mk = Mk,α . By Lemma δ 18, |Vk,α | = O m = O(1). Let c be the bound on |Vk,α |. Using Lemma 12 for each w ∈ Vk,α with k δ accuracy 2c , we have cw − pw = O m δ ∀ 2 S, ∀w ∈ Vk,α , Therefore, we have ∀δ S: ˜ ˆ |Mk − Mk | = |Mk,α − Mk,α | ≤ ∑ w∈Vk,α which completes the proof. 1 δ ln . m ln 1 1 δ ˜ =O √ , m m k − pw Xw,k = O m Theorem 30 Proof First, we show that for any two words u and v, Cov(Xu,k , Xv,k ) = Θ k that {cv |cu = k} ∼ Bin m − k, m−k . By Lemma 6, we have: P(cu = k) = P(cv = k) = 1 2πk 1 − P(cv = k|cu = k) = k m Tm , Tk Tm−k 1 k 2πk 1 − m−k Tm−k . Tk Tm−2k Cov(Xu,k , Xv,k ) = E[Xu,k Xv,k ] − E[Xu,k ]E[Xv,k ] m−k m 1260 1 k 1− m . Note (38) Using Tx = Θ(1) for x ≥ k, we have = P(cu = k)[P(cv = k|cu = k) − P(cv = k)] 1 Tm Tm−k 1 − = Tk Tm−k Tk Tm−2k k k 1 − m−k 2πk 1 − m 1 Tm−k Tm 1 1 = Θ − k 1 − k Tm−2k 1 − k Tm−k k m2 Tm Tk Tm−k C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS 1 1 Tm−k k Tm−2k = Θ 1− 1 + k m−k 1 − 1− k m Tm Tm−k . − Tm−2k Tm−k k 1− m (39) We can bound the first term of Equation (39): 1 1− k m−k 1 − 1− = k m k 1 − m−k = Θ 1− Since Tx = exp 1 12x +O Tm Tm−k − Tm−2k Tm−k 1 x2 k 1 − m−k k 1− m k 1− m − k k −1+ m m−k 1 = 1 + 12x + O 1 x2 =Θ k 1− m + k 1− m + k2 m2 . k 1 − m−k k 1 − m−k (40) for x ≥ m − 2k (note that k m), we have 2 Tm−k − Tm Tm−2k Tm−2k Tm−k 1 1 1 1 1 − − +O = Tm−2k Tm−k 6(m − k) 12m 12(m − 2k) m2 k2 1 = −Θ +O . 3 m m2 = Combining Equations (39), (40), and (41), we have Cov(Xu,k , Xv,k ) = Θ 1 k Θ Now we show that σ2 [Xw,k ] = Θ 1 √ k k2 m2 k2 m3 −Θ +O 1 m2 =Θ k m2 . By Equation (38), we have 1 σ2 [Xw,k ] = P(cw = k)(1 − P(cw = k)) = Θ √ k 1 1−Θ √ k 1 =Θ √ . k Now we find a bound for σ2 [Mk ]. σ2 [Mk ] = σ2 = ∑ w = m k = Θ . ∑ pw Xw,k w 2 2 pw σ [Xw,k ] + ∑ pu pvCov(Xu,k , Xv,k ) u=v k 2 1 Θ √ m k √ k , m + 1261 m m −1 k k k m 2 Θ k m2 (41) D RUKH AND M ANSOUR which completes the proof. A.3 Leave-One-Out Estimation of Log-Loss δ Lemma 34 Proof Using Lemma 9, we have ∀ 2 : n2 = |U ∩ S2 | and n1 = |U ∩ S1 |, where U = {w ∈ V : mpw ≤ c ln m }, for some c > 0. Let n2 = |U ∩ S2 | and n1 = |U ∩ S1 |. Let b = ln m . δ δ First, we show that E[n2 ] = O(bE[n1 ]). E[n2 ] = m 2 p (1 − pw )m−2 2 w ∑ w∈U = m − 1 pw 2 1 − pw ∑ mpw (1 − pw )m−1 w∈U = ∑ mpw (1 − pw )m−1 O(b) = O(bE[n1 ]). w∈U Next, we bound the deviation of n1 and n2 . A single change in the sample changes n1 , as well as n2 , by at most 1. Therefore, using Lemma 2 for n1 and n2 , we have n1 ≥ E[n1 ] − O δ ∀4 S : m ln 1 δ , n2 ≤ E[n2 ] + O δ ∀4 S : m ln 1 δ . Therefore, n2 ≤ E[n2 ] + O m ln 1 δ = O bE[n1 ] + m ln 1 δ = O b n1 + m ln 1 δ , which completes the proof. A.4 Log-Loss A Priori Theorem 39 Proof The KL-divergence is of course non-negative. By Lemma 40, we have δ ∀ 4 S, ∑ pw ln w∈S / pw qw δ By Lemma 41 with λ, we have ∀ 4 S: 1262 ≤ M0 ln n0 ln 4m δ αn1...τ . (42) C ONCENTRATION B OUNDS FOR U NIGRAM L ANGUAGE M ODELS ∑ pw ln λ ln 8m w∈S:pw ≤ m δ pw qw λ ln 1−α 8m δ ≤ 3 ln α + M0 + F 8m . m 1 − α λ lnm δ 8 δ (43) δ By Lemma 43 with λ, we have ∀ 2 S: ∑ pw ln λ ln 8m w∈S:pw > m δ pw qw ≤ 3 ln 8 3λ ln 8m δ δ N λ ln 8m . + √ √ δ m 2( λ − 3)2 m m (44) The proof follows by combining Equations (42), (43), and (44). Lemma 42 Proof Let f (x) = x2 2(1−∆)2 − x + ln(1 + x). Then, f (x) = f (x) = x 1 , −1+ (1 − ∆)2 1+x 1 1 − (1 + x)2 2 (1 − ∆) . Clearly, f (0) = f (0) = 0. Also, f (x) ≥ 0 for any x ∈ [−∆, ∆]. Therefore, f (x) is non-negative in the range above, and the lemma follows. References D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18:155–193, 1979. S. F. Chen. Building Probabilistic Models for Natural Language. PhD thesis, Harvard University, 1996. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, 1998. K. W. Church and W. A. Gale. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5: 19–54, 1991. J. R. Curran and M. Osborne. A very very large corpus doesn’t always yield reliable estimates. In Proceedings of the Sixth Conference on Natural Language Learning, pages 126–131, 2002. D. P. Dubhashi and D. Ranjan. Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2):99–124, 1998. 1263 D RUKH AND M ANSOUR P. Flajolet. Singularity analysis and asymptotics of Bernoulli sums. Theoretical Computer Science, 215:371–381, 1999. I. J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(16):237–264, 1953. I. J. Good. Turing’s anticipation of empirical Bayes in connection with the cryptanalysis of the naval Enigma. Journal of Statistical Computation and Simulation, 66(2):101–112, 2000. W. Hoeffding. On the distribution of the number of successes in independent trials. Annals of Mathematical Statistics, 27:713–721, 1956. W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13–30, 1963. S. B. Holden. PAC-like upper bounds for the sample complexity of leave-one-out cross-validation. In Proceesings of the Ninth Annual ACM Workshop on Computational Learning Theory, pages 41–50, 1996. S. M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3):400– 401, 1987. M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):1427–1453, 1999. S. Kutin. Algorithmic Stability and Ensemble-Based Learning. PhD thesis, University of Chicago, 2002. D. McAllester and L. Ortiz. Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, Special Issue on Learning Theory, 4(Oct): 895–911, 2003. D. McAllester and R. E. Schapire. On the convergence rate of Good-Turing estimators. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages 1–6, 2000. D. McAllester and R. E. Schapire. Learning theory and language modeling. In Seventeenth International Joint Conference on Artificial Intelligence, 2001. C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148– 188. London Math. Soc. Lectures Notes 141, Cambridge University Press, 1989. A. Orlitsky, N. P. Santhanam, and J. Zhang. Always Good Turing: Asymptotically optimal probability estimation. Science, 302(Oct):427–431, 2003. 1264
4 0.34740752 30 jmlr-2005-Estimating Functions for Blind Separation When Sources Have Variance Dependencies
Author: Motoaki Kawanabe, Klaus-Robert Müller
Abstract: A blind separation problem where the sources are not independent, but have variance dependencies is discussed. For this scenario Hyv¨ rinen and Hurri (2004) proposed an algorithm which requires a no assumption on distributions of sources and no parametric model of dependencies between components. In this paper, we extend the semiparametric approach of Amari and Cardoso (1997) to variance dependencies and study estimating functions for blind separation of such dependent sources. In particular, we show that many ICA algorithms are applicable to the variance-dependent model as well under mild conditions, although they should in principle not. Our results indicate that separation can be done based only on normalized sources which are adjusted to have stationary variances and is not affected by the dependent activity levels. We also study the asymptotic distribution of the quasi maximum likelihood method and the stability of the natural gradient learning in detail. Simulation results of artificial and realistic examples match well with our theoretical findings. Keywords: blind source separation, variance dependencies, independent component analysis, semiparametric statistical models, estimating functions
5 0.32753959 65 jmlr-2005-Separating a Real-Life Nonlinear Image Mixture
Author: Luís B. Almeida
Abstract: When acquiring an image of a paper document, the image printed on the back page sometimes shows through. The mixture of the front- and back-page images thus obtained is markedly nonlinear, and thus constitutes a good real-life test case for nonlinear blind source separation. This paper addresses a difficult version of this problem, corresponding to the use of “onion skin” paper, which results in a relatively strong nonlinearity of the mixture, which becomes close to singular in the lighter regions of the images. The separation is achieved through the MISEP technique, which is an extension of the well known INFOMAX method. The separation results are assessed with objective quality measures. They show an improvement over the results obtained with linear separation, but have room for further improvement. Keywords: ICA, blind source separation, nonlinear mixtures, nonlinear separation, image mixture, image separation
6 0.26236966 63 jmlr-2005-Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial
7 0.2504231 41 jmlr-2005-Kernel Methods for Measuring Independence
8 0.2488105 20 jmlr-2005-Clustering with Bregman Divergences
9 0.24699587 17 jmlr-2005-Change Point Problems in Linear Dynamical Systems
10 0.24485005 33 jmlr-2005-Fast Kernel Classifiers with Online and Active Learning
11 0.24393821 36 jmlr-2005-Gaussian Processes for Ordinal Regression
12 0.24354871 49 jmlr-2005-Learning the Kernel with Hyperkernels (Kernel Machines Section)
13 0.24073859 71 jmlr-2005-Variational Message Passing
14 0.24065587 31 jmlr-2005-Estimation of Non-Normalized Statistical Models by Score Matching
15 0.23971176 46 jmlr-2005-Learning a Mahalanobis Metric from Equivalence Constraints
16 0.23894519 64 jmlr-2005-Semigroup Kernels on Measures
17 0.23438953 39 jmlr-2005-Information Bottleneck for Gaussian Variables
18 0.23424156 19 jmlr-2005-Clustering on the Unit Hypersphere using von Mises-Fisher Distributions
19 0.23350255 3 jmlr-2005-A Classification Framework for Anomaly Detection
20 0.2319788 44 jmlr-2005-Learning Module Networks