acl acl2013 acl2013-325 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Brian Roark ; Cyril Allauzen ; Michael Riley
Abstract: We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of wellknown Kneser-Ney (1995) smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed when pruning Kneser-Ney models (Siivola et al., 2007; Chelba et al., 2010), while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff ngram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. An open-source version of the algorithm has been released as part of the OpenGrm ngram library.1
Reference: text
sentIndex sentText sentNum sentScore
1 Smoothed marginal distribution constraints for language modeling Brian Roark†◦, Cyril Allauzen◦ and Michael Riley◦ †Oregon Health & Science University, Portland, Oregon ◦Google, Inc. [sent-1, score-0.248]
2 Abstract We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of wellknown Kneser-Ney (1995) smoothing. [sent-4, score-0.457]
3 Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. [sent-5, score-0.642]
4 We present experimental results for heavily pruned backoff ngram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. [sent-9, score-1.032]
5 Such models are trained on large text corpora, by counting the frequency of n-gram collocations, then normalizing and smoothing (regularizing) the resulting multinomial distributions. [sent-12, score-0.275]
6 Standard techniques store the observed n-grams and derive probabilities ofunobserved n-grams via their longest observed suffix and “backoff” costs associated with the prefix histories of the unobserved suffixes. [sent-13, score-0.543]
7 Another common approach which we pursue in this paper is model pruning, whereby some number of the n-grams are removed from explicit storage in the model, so that their probability must be as– – – – signed via backoff smoothing. [sent-22, score-0.382]
8 While these greedy pruning methods are highly effective for models estimated with most common smoothing approaches, they have been shown to be far less effective with Kneser-Ney trained language models (Siivola et al. [sent-27, score-0.413]
9 , 2010), leading to severe degradation in model quality relative to other standard smoothing meth43 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. [sent-29, score-0.307]
10 (2010), demonstrating perplexity degradation of Kneser-Ney smoothed models in contrast to other common smoothing methods. [sent-40, score-0.72]
11 4-gram language models, pruned using Stolcke (1998) relative entropy pruning to approximately 1. [sent-42, score-0.416]
12 Thus, while Kneser-Ney may be the preferred smoothing method for large, unpruned models where it can achieve real improvements over other smoothing methods when relatively sparse, pruned models are required, it has severely diminished utility. [sent-45, score-0.673]
13 In their experiments (see Table 1 caption for specifics on training/test setup), they trained 4-gram Broadcast News language models using a variety of both backoff and interpolated smoothing methods and measured perplexity before and after Stolcke (1998) relative entropy based pruning. [sent-48, score-0.708]
14 With this size training data, the perplexity of all of the smoothing methods other than Kneser-Ney degrades from around 120 with the full model to around 200 with the heavily pruned model. [sent-49, score-0.635]
15 Kneser-Ney smoothed models have lower perplexity with the full model than the other methods by about 5 points, but degrade with pruning to far higher perplexity, between 270-285. [sent-50, score-0.726]
16 – – The cause of this degradation is Kneser-Ney’s unique method for estimating smoothed language models, which will be presented in more detail in Section 3. [sent-51, score-0.331]
17 Briefly, the smoothing method reestimates lower-order n-gram parameters in order to avoid over-estimating the likelihood of n-grams that already have ample probability mass allocated as part of higher-order n-grams. [sent-52, score-0.325]
18 This is done via a marginal distribution constraint which requires the expected frequency of the lower-order n-grams to match their observed frequency in the training data, much as is commonly done for maximum entropy model training. [sent-53, score-0.403]
19 Lower-order n-gram parameters resulting from Kneser-Ney are not relative frequency estimates, as with other smoothing methods; rather they are parameters estimated specifically for use within the larger smoothed model. [sent-55, score-0.558]
20 First, the pruning methods commonly use lower order n-gram probabilities to derive an estimate of state marginals, and, since these parameters are no longer smoothed relative frequency estimates, they do not serve that purpose well. [sent-57, score-0.76]
21 For this reason, the widely-used SRILM toolkit recently provided switches to modify their pruning algorithm to use another model for state marginal estimates (Stolcke et al. [sent-58, score-0.529]
22 Second, and perhaps more importantly, the marginal constraints that were applied prior to smoothing will not in general be consistent with the much smaller pruned model. [sent-60, score-0.563]
23 In this paper, we present an algorithm that imposes marginal distribution constraints of the sort used in Kneser-Ney modeling on arbitrary smoothed backoff n-gram language models. [sent-63, score-0.815]
24 Our approach makes use of the same sort of deriva- tion as the original Kneser-Ney modeling, but, among other differences, relies on smoothed estimates of the empirical relative frequency rather than the unsmoothed observed frequency. [sent-64, score-0.543]
25 The algorithm can be applied after the smoothed model has been pruned, hence avoiding the pitfalls associated with Kneser-Ney modeling. [sent-65, score-0.348]
26 Furthermore, while Kneser-Ney is conventionally defined as a variant of absolute discounting, our method can be applied to models smoothed with any backoff smoothing, including mixtures of models, widely 44 used for domain adaptation. [sent-66, score-0.611]
27 We next establish formal preliminaries and our smoothed marginal distribution constraints method. [sent-67, score-0.52]
28 2 Preliminaries N-gram language models are typically presented mathematically in terms of words w, the strings (histories) h that precede them, and the suffixes of the histories (backoffs) h0 that are used in the smoothing recursion. [sent-68, score-0.395]
29 te W single symvboarlsia bfrolems u ut,hev vocabulary; h, g ∈ eVno∗ tteo sdinengolete s yhmis-tory sequences preceding ht,hge specific word; and h0, g0 ∈ V∗ the respective backoff histories of h and g as typically defined (see below). [sent-74, score-0.452]
30 w|w| we can calculate the smoothed conditional probability of each word wi in the sequence given the k words that preceded it, depending on the order of the Markov model. [sent-78, score-0.4]
31 Then the smoothed model is defined recursively as follows: P(wi| hik) = ? [sent-83, score-0.303]
32 Note that for h = hik, the typically defined backoff history h0 = hik−1, i. [sent-88, score-0.38]
33 βα((hhikik,whi)ik−1) P(wi|hik−1)if o hthikweriw∈is Ge (1) where β(hikwi) is the parameter associated with the n-gram hikwi and α(hik, is the backoff cost associated with going from state hik to state We assume that, if hw ∈ G then all prefixes and s. [sent-100, score-0.958]
34 States represent histories h, and the words w, whose probabilities are conditioned on h, label the arcs, leading to the history state for the subsequent word. [sent-104, score-0.507]
35 , ‘xyz’ indicates that the state represents a history with the previous three hik−1) hik−1. [sent-107, score-0.265]
36 Failure arcs, labeled with a φ in Figure 1, encode an “otherwise” semantics and have as destination the origin state’s backoff history. [sent-109, score-0.263]
37 Note that, in general, the recursive definition of backoff may require the traversal of several back- presented for convenience, to specify the history implicitly encoded by the state. [sent-111, score-0.38]
38 , the highest order states in Figure 1 needing to traverse a couple of φ arcs to reach state ‘z’ . [sent-114, score-0.375]
39 We can define the backoff cost between a state hik and any of its suffix states as follows. [sent-115, score-0.831]
40 3 Marginal distribution constraints (2) Marginal distribution constraints attempt to match the expected frequency of an n-gram with its observed frequency. [sent-124, score-0.308]
41 Using standard n-gram notation (where g0 is the backoff history for g), this constraint is stated in Kneser and Ney (1995) as bP(w | h0) = X P(g,w | h0) (3) g:gX0 X=h0 Pb where is the empirical relative frequency estimate. [sent-127, score-0.505]
42 First, we will make use of regularized estimates of relative frequency P rather than raw relative frequency Pb. [sent-132, score-0.289]
43 Second, rather than just looking at observed hbistories h that back off to h0, we will look at ball histories (observed or not) of the length of the longest history in the model. [sent-133, score-0.447]
44 For notational simplicity, suppose we have an n+1-gram model, hence the longest history in the model is of length n. [sent-134, score-0.288]
45 Assume the length of the particular backoff history |h0 | = k. [sent-135, score-0.38]
46 Then we can restate thhe ∈ marginal distribution constraint in equation 3 as P(w | h0) =h∈VXn−kh0P(h,w | h0) (4) Next we solve for β(h0w) parameters used in equation 1. [sent-138, score-0.395]
47 Keep in mind, P is the target expected frequency from a given smoothed model. [sent-145, score-0.341]
48 This means that we cannot simply ‘repair’ pruned Kneser-Ney models, but must use other smoothing methods where the smoothed values are based on relative frequency estimation. [sent-147, score-0.795]
49 There are, in addition, two other important differences in our approach from that in Kneser and Ney (1995), which would remain as differences even if our target expected frequency were the unsmoothed relative frequency Pb instead of the smoothed estimate P. [sent-148, score-0.5]
50 First, theb sum in the numerator is over histories of lengtbh n, the highest order in the n-gram model, whereas in the KneserNey approach the sum is over histories that immediately back off to h0, i. [sent-149, score-0.486]
51 We briefly note that, like Kneser-Ney, if the baseline smoothing method is consistent, then the amount of smoothing in the limit will go to zero and our resulting model will also be consistent. [sent-159, score-0.397]
52 8 are given values (from the input smoothed model and previous stages in the algorithm, respectively), implying an algorithm that estimates highest orders of the model first. [sent-161, score-0.373]
53 In addition, steady state history probabilities P(h) must be calculated. [sent-162, score-0.493]
54 4 Model constraint algorithm Our algorithm takes a smoothed backoff n-gram language model in an automaton format (see Figure 1) and returns a smoothed backoff n-gram lan- guage model with the same topology. [sent-164, score-1.249]
55 , that are backed-off to we calculate the weight provided by equation 8 and assign it (after normalization) to the appropriate n-gram arc in the automaton. [sent-167, score-0.267]
56 First, we must provide a probability for every state in the model. [sent-169, score-0.265]
57 1 Steady state probability calculation The steady state probability P(h) is taken to be the probability of observing h after a long word sequence, i. [sent-173, score-0.651]
58 wmh) where Pˆ is the corpus probability derived (9) as fol- lows: The smoothed n-gram probability model P(w | h) is naturally extended to a sentence s = w0 . [sent-181, score-0.401]
59 2 Assuming the n-gram language model automaton G has a single final state into 2Pˆ models words in a corpus rather than a single sentence since Equation 9 tends to zero as m → ∞ otherwise. [sent-192, score-0.353]
60 arc from the state to the initial state and having a final weight 1− λ in order to leave the ahuatvoinmga taon fi naat lt whee −sta λte wni ollr dmerod toel tehaivse corpus distribution. [sent-196, score-0.433]
61 The backoff arcs have a special interpretation in the automaton: they are traversed only if a word fails to match at the higher order. [sent-201, score-0.4]
62 These failure arcs must be properly handled before applying standard stationary distribution calculations. [sent-202, score-0.34]
63 A simple approach would be for each word w0 and state h such that hw0 ∈/ G, but h0w0 ∈ G, add a w0 arc from state h to ∈h0 Gw0, w buitth h weight α(h, h0)β(h0w0) and then remove all failure arcs (see Figure 2a). [sent-203, score-0.62]
64 This however results in an automaton with |V | arcs leaving every state, swinhiacnh aius unwieldy iwthith|V larger vocabularies and n-gram orders. [sent-204, score-0.281]
65 Instead, for each word w and state h such that hw ∈ G, add a w arc wfroormd sw ta taen hd sttoa the0 wh swucithh weight −α(h, h0)β(h0w) and then replace awll wfaiiltuhre w leaibgehlts − −wαit(hh ? [sent-205, score-0.351]
66 In this case, the added negativelyweighted arcs compensate for the excess probability mass allowed by the epsilon arcs3. [sent-207, score-0.271]
67 2 Accumulation of higher order values We are summing over all possible histories of length n in equation 8, and the steady state probability calculation outlined in the previous section includes the probability mass for histories h ∈ G. [sent-210, score-0.984]
68 ing allocated to the state representing their longest suffix that is explicitly in G. [sent-212, score-0.382]
69 That is the state that would be active when these histories are encountered. [sent-213, score-0.337]
70 Hence, once we have calculated the steady state probabilities for each state in the smoothed model, we only need to sum over states explicitly in the model. [sent-214, score-0.84]
71 Figure 2: Schemata showing failure arc handling: (a) φ removal: add w0 arc (red), delete φ arc; (b) φ replacement: add w arc (red), replace φ by ? [sent-216, score-0.461]
72 Since we assume that, for every ngram hw ∈ G, every prefix and suffix is also ignr Gm, we k ∈no wG th evate iyf phr0ewfi ∈ Gd sthufefnix th isere al sios no history h such that h0w wis a Gsuf tfihxe nof t he ean ids hw ∈ G. [sent-219, score-0.509]
73 Hence, for every state h directly backing off to h0 (order |V |), and for every n-gram arc leaving stat(eo rhd0e (order |V |), some vearylu ne- gmrausmt b aer accumulated. [sent-225, score-0.373]
74 T(hoisr can V be |) particularly clearly seen autm mthueunigram state, which has an arc for every unigram (the size of the vocabulary): for every bigram state (also order of the vocabulary), in the naive algorithm we must look for every possible arc. [sent-226, score-0.447]
75 Importantly, the denominator is calculated by first assuming that all higher order states back off to the current n-gram, then subtract out the mass associated with those that are already observed at the higher order. [sent-229, score-0.296]
76 Because smoothing is not necessarily con48 strained across n-gram orders, it is possible that higher-order n-grams could be smoothed less than lower order n-grams, so that the numerator of equation 8 can be less than zero, which is not valid. [sent-232, score-0.607]
77 A value less than zero means that the higher order n-grams will already produce the n-gram more frequently than its smoothed expected frequency. [sent-233, score-0.316]
78 The α(h, h0) backoff weights in the denominator ensure normalization at the higher order states, and change as the n-gram parameters at the current state are modified. [sent-244, score-0.526]
79 Further, the steady state probabilities will change as the model changes. [sent-245, score-0.368]
80 Hence, at each state, we must iterate the calculation ofthe denominator term: first adjust n-gram weights and normalize; then recalculate backoff weights at higher order states and iterate. [sent-246, score-0.543]
81 After the entire model has been re-estimated, the steady state probability calculation presented in Section 4. [sent-248, score-0.436]
82 This fixes part ofthe problem perplexity does not degrade as much as the Kneser-Ney pruned model in Table 1 but, as argued earlier in this paper, this is not the sole reason for the degradation and the perplexity remains extremely inflated. [sent-272, score-0.687]
83 Note that unigrams – – – – in the models are never pruned, hence all models assign probabilities over an identical vocabulary and perplexity is comparable across models. [sent-275, score-0.4]
84 We then apply the marginal distribution constraints, and convert the result back to ARPA format for perplexity evaluation with the SRILM toolkit. [sent-279, score-0.443]
85 For pruned models, the SRILM toolkit does not impose such a requirement, hence explicit arcs are added to the 49 AWSDMmibR,tWoKesi. [sent-282, score-0.38]
86 m7289S0Ts) Table 3: Perplexity reductions achieved with marginal distribution constraints (MDC) on the heavily pruned models from Chelba et al. [sent-288, score-0.67]
87 The resulting model is equivalent, but with a small number of additional arcs in the explicit representation (around 1% for the most heavily pruned models). [sent-292, score-0.428]
88 Table 3 presents perplexity results for models that result from applying our marginal distribution constraints to the four heavily pruned models from Table 2. [sent-293, score-0.781]
89 In all four cases, we get perplexity reductions of around 10 points. [sent-294, score-0.3]
90 (2010), we produced a mixture model by interpolating (with equal weight) smoothed ngram probabilities from the full (unpruned) absolute discounting, Witten-Bell and Katz models, which share the same set of n-grams. [sent-297, score-0.435]
91 Figure 3 demonstrates the importance of iterating the steady state history calculation. [sent-299, score-0.401]
92 All of the methods achieve perplexity reductions with subsequent iterations. [sent-300, score-0.3]
93 50 tribution constraints (MDC) on the heavily pruned models from Chelba et al. [sent-324, score-0.36]
94 The perplexity reductions that were achieved for these models do translate to real word error rate reductions at both stages of between 0. [sent-327, score-0.462]
95 For pruned Kneser-Ney models, fixing the state marginals with the -prune-history-lm switch reduces the WER versus the original pruned model, but no reductions were achieved vs. [sent-332, score-0.7]
96 Table 5 presents perplexity and WER results for less heavily pruned models, where the pruning thresholds were set to yield approximately 1. [sent-334, score-0.605]
97 Performance of Witten-Bell models with the marginal distribution constraints degrade badly for the larger models, indicating that this method of regularization, unmodified by aggressive pruning, does not provide a well suited distribution for this sort of optimization. [sent-340, score-0.402]
98 We speculate that this is due to underregularization, having noted some floating point precision issues when allowing the backoff recalculation to run indefinitely. [sent-341, score-0.263]
99 6 Summary and Future Directions The presented method reestimates lower order n-gram model parameters for a given smoothed backoff model, achieving perplexity and WER reductions for many smoothed models. [sent-342, score-1.172]
100 We anticipate a pre-constraint on the baseline smoothing method, that would recognize this problem and adjust the smoothing to ensure that a solution does exist. [sent-346, score-0.322]
wordName wordTfidf (topN-words)
[('smoothed', 0.272), ('backoff', 0.263), ('hik', 0.256), ('vxn', 0.212), ('chelba', 0.205), ('pruned', 0.198), ('histories', 0.189), ('perplexity', 0.183), ('hwg', 0.174), ('pruning', 0.162), ('smoothing', 0.161), ('marginal', 0.149), ('state', 0.148), ('arcs', 0.137), ('arc', 0.137), ('steady', 0.136), ('vn', 0.133), ('reductions', 0.117), ('history', 0.117), ('hwgw', 0.116), ('denominator', 0.115), ('suffix', 0.11), ('equation', 0.101), ('opengrm', 0.096), ('kneser', 0.085), ('automaton', 0.085), ('hikwi', 0.077), ('mdc', 0.077), ('numerator', 0.073), ('calculation', 0.072), ('stationary', 0.07), ('frequency', 0.069), ('kneserney', 0.068), ('discounting', 0.068), ('hw', 0.066), ('longest', 0.065), ('srilm', 0.064), ('unpruned', 0.063), ('riley', 0.063), ('wer', 0.063), ('heavily', 0.062), ('ney', 0.062), ('degradation', 0.059), ('talbot', 0.059), ('siivola', 0.058), ('allauzen', 0.056), ('relative', 0.056), ('constraints', 0.055), ('states', 0.054), ('probabilities', 0.053), ('mass', 0.051), ('stolcke', 0.051), ('failure', 0.05), ('wi', 0.05), ('probability', 0.049), ('ngram', 0.048), ('ngrams', 0.047), ('arpa', 0.047), ('hence', 0.045), ('models', 0.045), ('prefix', 0.044), ('distribution', 0.044), ('roark', 0.044), ('zero', 0.044), ('observed', 0.041), ('broadcast', 0.039), ('must', 0.039), ('marginals', 0.039), ('estimates', 0.039), ('ciprian', 0.039), ('iotyf', 0.039), ('openfst', 0.039), ('seymore', 0.039), ('wfst', 0.039), ('cyril', 0.037), ('bigram', 0.036), ('couple', 0.036), ('back', 0.035), ('acoustic', 0.035), ('unsmoothed', 0.034), ('epsilon', 0.034), ('reformatted', 0.034), ('reestimates', 0.034), ('irreducible', 0.034), ('degrade', 0.033), ('format', 0.032), ('sort', 0.032), ('convenience', 0.032), ('katz', 0.032), ('model', 0.031), ('absolute', 0.031), ('leaving', 0.03), ('allocated', 0.03), ('notational', 0.03), ('bloom', 0.03), ('speech', 0.03), ('calculate', 0.029), ('vocabulary', 0.029), ('every', 0.029), ('explicitly', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000017 325 acl-2013-Smoothed marginal distribution constraints for language modeling
Author: Brian Roark ; Cyril Allauzen ; Michael Riley
Abstract: We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of wellknown Kneser-Ney (1995) smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed when pruning Kneser-Ney models (Siivola et al., 2007; Chelba et al., 2010), while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff ngram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. An open-source version of the algorithm has been released as part of the OpenGrm ngram library.1
2 0.17943686 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation
Author: Kenneth Heafield ; Ivan Pouzyrevsky ; Jonathan H. Clark ; Philipp Koehn
Abstract: We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.
3 0.14409813 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
Author: Wenduan Xu ; Yue Zhang ; Philip Williams ; Philipp Koehn
Abstract: We present a context-sensitive chart pruning method for CKY-style MT decoding. Source phrases that are unlikely to have aligned target constituents are identified using sequence labellers learned from the parallel corpus, and speed-up is obtained by pruning corresponding chart cells. The proposed method is easy to implement, orthogonal to cube pruning and additive to its pruning power. On a full-scale Englishto-German experiment with a string-totree model, we obtain a speed-up of more than 60% over a strong baseline, with no loss in BLEU.
Author: Heike Adel ; Ngoc Thang Vu ; Tanja Schultz
Abstract: In this paper, we investigate the application of recurrent neural network language models (RNNLM) and factored language models (FLM) to the task of language modeling for Code-Switching speech. We present a way to integrate partof-speech tags (POS) and language information (LID) into these models which leads to significant improvements in terms of perplexity. Furthermore, a comparison between RNNLMs and FLMs and a detailed analysis of perplexities on the different backoff levels are performed. Finally, we show that recurrent neural networks and factored language models can . be combined using linear interpolation to achieve the best performance. The final combined language model provides 37.8% relative improvement in terms of perplexity on the SEAME development set and a relative improvement of 32.7% on the evaluation set compared to the traditional n-gram language model. Index Terms: multilingual speech processing, code switching, language modeling, recurrent neural networks, factored language models
Author: Tze Yuang Chong ; Rafael E. Banchs ; Eng Siong Chng ; Haizhou Li
Abstract: In this paper, we explore the use of distance and co-occurrence information of word-pairs for language modeling. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long history-contexts. Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively. Compared to the distant bigram, we show that word-pairs can be more effectively modeled in terms of both distance and occurrence. 1
6 0.10601839 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
7 0.087602116 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data
8 0.082084298 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD
9 0.068017051 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
10 0.064981043 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
11 0.064463399 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers
12 0.063495941 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models
13 0.061365895 124 acl-2013-Discriminative state tracking for spoken dialog systems
14 0.058317076 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
15 0.058264047 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy
16 0.058230076 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
17 0.058174741 257 acl-2013-Natural Language Models for Predicting Programming Comments
18 0.057238329 224 acl-2013-Learning to Extract International Relations from Political Context
19 0.057204254 133 acl-2013-Efficient Implementation of Beam-Search Incremental Parsers
20 0.055942655 250 acl-2013-Models of Translation Competitions
topicId topicWeight
[(0, 0.159), (1, -0.035), (2, 0.016), (3, -0.004), (4, -0.021), (5, -0.04), (6, 0.059), (7, 0.001), (8, -0.07), (9, -0.01), (10, -0.037), (11, -0.018), (12, -0.031), (13, -0.097), (14, -0.029), (15, -0.053), (16, -0.059), (17, -0.013), (18, 0.036), (19, -0.073), (20, -0.011), (21, -0.012), (22, 0.075), (23, 0.038), (24, 0.015), (25, -0.084), (26, 0.044), (27, 0.018), (28, 0.033), (29, -0.06), (30, 0.007), (31, -0.009), (32, -0.11), (33, 0.091), (34, 0.038), (35, -0.119), (36, 0.145), (37, -0.057), (38, -0.118), (39, -0.026), (40, -0.117), (41, 0.043), (42, -0.208), (43, 0.021), (44, -0.028), (45, 0.069), (46, -0.089), (47, -0.011), (48, 0.076), (49, -0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.93717313 325 acl-2013-Smoothed marginal distribution constraints for language modeling
Author: Brian Roark ; Cyril Allauzen ; Michael Riley
Abstract: We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of wellknown Kneser-Ney (1995) smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed when pruning Kneser-Ney models (Siivola et al., 2007; Chelba et al., 2010), while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff ngram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. An open-source version of the algorithm has been released as part of the OpenGrm ngram library.1
2 0.84372693 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation
Author: Kenneth Heafield ; Ivan Pouzyrevsky ; Jonathan H. Clark ; Philipp Koehn
Abstract: We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.
Author: Tze Yuang Chong ; Rafael E. Banchs ; Eng Siong Chng ; Haizhou Li
Abstract: In this paper, we explore the use of distance and co-occurrence information of word-pairs for language modeling. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long history-contexts. Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively. Compared to the distant bigram, we show that word-pairs can be more effectively modeled in terms of both distance and occurrence. 1
4 0.64139819 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data
Author: David Kauchak
Abstract: In this paper we examine language modeling for text simplification. Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model. We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. We evaluate the models intrinsically with perplexity and extrinsically on the lexical simplification task from SemEval 2012. We find that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data. Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams.
Author: Heike Adel ; Ngoc Thang Vu ; Tanja Schultz
Abstract: In this paper, we investigate the application of recurrent neural network language models (RNNLM) and factored language models (FLM) to the task of language modeling for Code-Switching speech. We present a way to integrate partof-speech tags (POS) and language information (LID) into these models which leads to significant improvements in terms of perplexity. Furthermore, a comparison between RNNLMs and FLMs and a detailed analysis of perplexities on the different backoff levels are performed. Finally, we show that recurrent neural networks and factored language models can . be combined using linear interpolation to achieve the best performance. The final combined language model provides 37.8% relative improvement in terms of perplexity on the SEAME development set and a relative improvement of 32.7% on the evaluation set compared to the traditional n-gram language model. Index Terms: multilingual speech processing, code switching, language modeling, recurrent neural networks, factored language models
6 0.55755073 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation
7 0.54867589 390 acl-2013-Word surprisal predicts N400 amplitude during reading
8 0.52269095 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
9 0.47770309 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.
10 0.44724867 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
11 0.44544077 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach
12 0.4397296 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs
13 0.43753862 322 acl-2013-Simple, readable sub-sentences
14 0.43673179 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
15 0.42774802 250 acl-2013-Models of Translation Competitions
16 0.42442858 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression
17 0.42429608 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
18 0.40807909 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors
19 0.40631166 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
20 0.40215704 257 acl-2013-Natural Language Models for Predicting Programming Comments
topicId topicWeight
[(0, 0.07), (6, 0.037), (11, 0.07), (24, 0.04), (26, 0.075), (28, 0.012), (35, 0.076), (40, 0.01), (42, 0.049), (48, 0.064), (57, 0.242), (70, 0.056), (88, 0.019), (90, 0.03), (95, 0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.80863988 325 acl-2013-Smoothed marginal distribution constraints for language modeling
Author: Brian Roark ; Cyril Allauzen ; Michael Riley
Abstract: We present an algorithm for re-estimating parameters of backoff n-gram language models so as to preserve given marginal distributions, along the lines of wellknown Kneser-Ney (1995) smoothing. Unlike Kneser-Ney, our approach is designed to be applied to any given smoothed backoff model, including models that have already been heavily pruned. As a result, the algorithm avoids issues observed when pruning Kneser-Ney models (Siivola et al., 2007; Chelba et al., 2010), while retaining the benefits of such marginal distribution constraints. We present experimental results for heavily pruned backoff ngram models, and demonstrate perplexity and word error rate reductions when used with various baseline smoothing methods. An open-source version of the algorithm has been released as part of the OpenGrm ngram library.1
2 0.79953688 230 acl-2013-Lightly Supervised Learning of Procedural Dialog Systems
Author: Svitlana Volkova ; Pallavi Choudhury ; Chris Quirk ; Bill Dolan ; Luke Zettlemoyer
Abstract: Procedural dialog systems can help users achieve a wide range of goals. However, such systems are challenging to build, currently requiring manual engineering of substantial domain-specific task knowledge and dialog management strategies. In this paper, we demonstrate that it is possible to learn procedural dialog systems given only light supervision, of the type that can be provided by non-experts. We consider domains where the required task knowledge exists in textual form (e.g., instructional web pages) and where system builders have access to statements of user intent (e.g., search query logs or dialog interactions). To learn from such textual resources, we describe a novel approach that first automatically extracts task knowledge from instructions, then learns a dialog manager over this task knowledge to provide assistance. Evaluation in a Microsoft Office domain shows that the individual components are highly accurate and can be integrated into a dialog system that provides effective help to users.
3 0.77675164 23 acl-2013-A System for Summarizing Scientific Topics Starting from Keywords
Author: Rahul Jha ; Amjad Abu-Jbara ; Dragomir Radev
Abstract: In this paper, we investigate the problem of automatic generation of scientific surveys starting from keywords provided by a user. We present a system that can take a topic query as input and generate a survey of the topic by first selecting a set of relevant documents, and then selecting relevant sentences from those documents. We discuss the issues of robust evaluation of such systems and describe an evaluation corpus we generated by manually extracting factoids, or information units, from 47 gold standard documents (surveys and tutorials) on seven topics in Natural Language Processing. We have manually annotated 2,625 sentences with these factoids (around 375 sentences per topic) to build an evaluation corpus for this task. We present evaluation results for the performance of our system using this annotated data.
4 0.72655028 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering
Author: Anthony Fader ; Luke Zettlemoyer ; Oren Etzioni
Abstract: We study question answering as a machine learning problem, and induce a function that maps open-domain questions to queries over a database of web extractions. Given a large, community-authored, question-paraphrase corpus, we demonstrate that it is possible to learn a semantic lexicon and linear ranking function without manually annotating questions. Our approach automatically generalizes a seed lexicon and includes a scalable, parallelized perceptron parameter estimation scheme. Experiments show that our approach more than quadruples the recall of the seed lexicon, with only an 8% loss in precision.
5 0.65676844 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching
Author: Jinho D. Choi ; Andrew McCallum
Abstract: We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach. Selectional branching is guaranteed to perform a fewer number of transitions than beam search yet performs as accurately. We also present a new transition-based dependency parsing algorithm that gives a complexity of O(n) for projective parsing and an expected linear time speed for non-projective parsing. With the standard setup, our parser shows an unlabeled attachment score of 92.96% and a parsing speed of 9 milliseconds per sentence, which is faster and more accurate than the current state-of-the-art transitionbased parser that uses beam search.
6 0.61324203 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
7 0.61316043 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
8 0.61295581 275 acl-2013-Parsing with Compositional Vector Grammars
9 0.60895497 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
10 0.6072194 318 acl-2013-Sentiment Relevance
11 0.60528368 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
12 0.60475326 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
13 0.60465884 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
14 0.60422993 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction
15 0.60259134 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
16 0.60252309 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
17 0.6015895 222 acl-2013-Learning Semantic Textual Similarity with Structural Representations
18 0.60121703 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation
19 0.60041553 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions
20 0.60010606 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search