acl acl2011 acl2011-142 knowledge-graph by maker-knowledge-mining

142 acl-2011-Generalized Interpolation in Decision Tree LM

Source: pdf

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. [sent-3, score-0.228]

2 In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. [sent-4, score-0.788]

3 We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. [sent-5, score-0.547]

4 Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model. [sent-6, score-0.606]

5 1 Introduction A prominent use case for Language Models (LMs) in NLP applications such as Automatic Speech Recognition (ASR) and Machine Translation (MT) is selection of the most fluent word sequence among multiple hypotheses. [sent-7, score-0.034]

6 wm ≡ w1m, assuming that higher probability corresponds t≡o more fluent hypotheses. [sent-11, score-0.034]

7 LMs are often represented in the following generative form: Ym p(w1m) = Yp(wi|w1i−1) Yi=1 In the following discussion, we will refer to the function p(wi as a language model. [sent-12, score-0.059]

8 620 Mary Harper† †DepartmMenat royf CHoarmppeurt†er Science U†nDiveeprsarittym oefn Mt oafr Cyloamndp,u Cteorll Secgiee nPcaerk mharper@ umd edu |w1i−1) . [sent-13, score-0.081]

9 The context space is stiil−l fn+ar1 too large; therefore, the models are recursively smoothed using lower order distributions. [sent-15, score-0.218]

10 For instance, in a widely used n-gram LM, the probabilities are estimated as follows: ˜p(wi|wi −−n1+1) = ρ(wi|wii−−n1+1) + γ(wii−−1n+1) ·˜ p(wi|wi −−1n+2) (1) where ρ is a discounted probability1 . [sent-16, score-0.103]

11 In addition to n-gram models, there are many other ways to estimate probability distributions p(wi ; in this work, we are particularly in- |wi −−1n+1) teres|tewdi −inn+ m1odels involving decision trees (DTs). [sent-17, score-0.218]

12 As in n-gram models, DT models also often utilize interpolation with lower order models; however, there are issues concerning the interpolation which arise from the fact that decision trees permit arbitrary clustering of context, and these issues are the main subject of this paper. [sent-18, score-1.085]

13 1We refer the reader to (Chen and Goodman, 1999) for a survey of the discounting methods for n-gram models. [sent-19, score-0.131]

14 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 620–624, 2 Decision Trees The vast context space in a language model mandates the use of context clustering in some form. [sent-22, score-0.181]

15 In n-gram models, the clustering can be represented as a k-ary decision tree of depth n − 1, where k is the sai kze- aoryf th deec vocabulary. [sent-23, score-0.358]

16 N deopteth hth nat − −th 1is, wish a very constrained form of a decision tree, and is probably suboptimal. [sent-24, score-0.161]

17 Indeed, it is likely that some of the clusters predict very similar distributions of words, and the model would benefit from merging them. [sent-25, score-0.065]

18 Therefore, it is reasonable to believe that arbitrary (i. [sent-26, score-0.048]

19 , unconstrained) context clustering such as a decision tree should be able to outperform the n-gram model. [sent-28, score-0.348]

20 A decision tree provides us with a clustering function Φ(wii−−1n+1) → {Φ1, . [sent-29, score-0.326]

21 Another advantage of using decision trees is the ease of adding parameters such as syntactic tags: p(w1m) = X p(w1mt1m) = X Ymp(witi|w1i−1t1i−1) t 1X ≈ X. [sent-33, score-0.216]

22 tmiY=m1p(witi|Φ(wii−−n1+1tii−−n1+1)) (3) In this case, the decision tree would cluster the context space wii−−n1+1tii−−1n+1 based on information theoretic metricis−, nw+i1thoi−unt utilizing heuristics for which order the context attributes are to be backed off (cf. [sent-42, score-0.468]

23 In subsequent discussion, we will write equations for word models (Eq. [sent-45, score-0.044]

24 2), but they are equally applicable tojoint models (Eq. [sent-46, score-0.044]

25 3 Backoff Property Let us rewrite the interpolation Eq. [sent-48, score-0.322]

26 1 in a more generic way: ˜p (wi|w1i−1) = ρn(wi|Φn(w1i−1)) + (4) γ(Φn(wi1−1)) · ˜p(wi|BOn−1(w1i−1)) 621 where, ρn is a discounted distribution, Φn is a clustering function of order n, and γ(Φn(wi1−1)) is the backoff weight chosen to normalize the distribution. [sent-49, score-0.538]

27 BOn−1 is the backoff clustering function of order n − 1, representing a reduction of context size. [sent-50, score-0.515]

28 s Ient of word sequences where the last n − 1 words are wii−−n1+1, similarly, wBOhenr−e1 t(hwe1i l−as1t) nis −th 1e wseotr dofs sequences ending with wii−−n1+2. [sent-52, score-0.146]

29 In the case of a decision tree model, the same b2ackoff function is typically used, but the clustering function can be arbitrary. [sent-53, score-0.352]

30 4 is that the backoff context BOn−1 (w1i−1) allows for more robust (but less informed) probability estimation than the context cluster Φn(wi1−1). [sent-55, score-0.54]

31 More precisely: ∀w1i−1,W : W ∈ Φn(wi1−1) ⇒ W ∈ BOn−1(w1i−1) (5) that is, every word sequence W that belongs to a context cluster Φn(wi1−1), belongs to the same backoff cluster BOn−1 (w1i−1) (hence has the same backoff distribution). [sent-56, score-0.908]

32 For n-gram models, Property 5 trivially holds since BOn−1(w1i−1) and Φn(wi1−1) are defined as sets of sequences ending with wii−−n1+2 and wii−−n1+1 with the former clearly being a superset of the il−antte+r1. [sent-57, score-0.114]

33 Let us consider what happens when we have two context sequences W and W0 that belong to the same cluster Φn(W) = Φn(W0) but different backoff clusters BOn−1 (W) BOn−1 (W0). [sent-61, score-0.527]

34 For example: suppose we h(Wave) Φ(wi−2wi−1) = ({on}, {may,june}) and two corresponding backoff c({luosnte}r,s{: BayO,ju0 = ({may}) a cnodr rBesOpo00n = ({june}) . [sent-62, score-0.315]

35 Therefore we have much less faith in p˜(wi |BO0) than in ˜p(wi |BO00) and would like a much sm|aBllOer weight γ assigned to BO0, but it is not possible in the backoff scheme in Eq. [sent-65, score-0.342]

36 4, thus we will have to settle on a compromise value of γ, resulting in suboptimal performance. [sent-66, score-0.079]

37 We would expect this effect to be more pronounced in higher order models, because violations of Property 5 are less frequent in lower order models. [sent-67, score-0.121]

38 Indeed, in a 2-gram model, the property is never violated since its backoff, unigram, contains the entire context in one cluster. [sent-68, score-0.169]

39 The 3-gram example above, Φ(wi−2wi−1) = ({on}, {may,june}), although illustrative, is not likely t{om occur bee}c)a,us aelt may hin wi−1 position w noillt likely be split from june very early on, since it is very informative about the following word. [sent-69, score-0.034]

40 Thus, arbitrary clustering (an advantage soibf DTs) leads to violation of Property 5, which, we argue, may lead to a degradation of performance if back- off interpolation Eq. [sent-71, score-0.526]

41 In the next section, we generalize the interpolation scheme which, as we show in Section 6, allows us to find a better solution in the face of the violation of Property 5. [sent-73, score-0.424]

42 pn(wi |φn) is the probability distribution at the cluster φn iφn the tree of order n. [sent-75, score-0.236]

43 This interpolation method is particularly useful as, unlike count-based discounting methods (e. [sent-76, score-0.377]

44 , KneserNey), it can be applied to already smooth distributions pn2. [sent-78, score-0.034]

45 λˆn(φn) λˆn−1(φn−1) 2In decision trees, the distribution at a cluster (leaf) is often recursively interpolated with its parent node, e. [sent-83, score-0.371]

46 , the weight assigned )t ios th liem higher o (1rd−erλ model. [sent-89, score-0.054]

47 Ideally we should be able to assign a different set of interpolation weights for every eligible combination of clusters φn, φn−1 , . [sent-90, score-0.382]

48 However, not only is the number of such combinations extremely large, but many of them will not be observed in the train- ing data, making parameter estimation cumbersome. [sent-94, score-0.04]

49 Therefore, we propose the following parameterization for the interpolation of decision tree models: ˜pn(wi|wi −−n1+1) =Pnm=1Pλmnm(=φ1mλ)m ·( pφmm()wi|φm) (8) Note that this parameterization has the same number of parameters as in Eq. [sent-95, score-0.821]

50 7 (one per cluster in every tree), but the number of degrees of freedom is larger because the the parameters are not constrained to sum to 1, hence the denominator. [sent-96, score-0.151]

51 8, there is no explicit distinction between higher order and backoff models. [sent-98, score-0.356]

52 Indeed, it acknowledges that lower order models are not backoff models when Property 5 is not satisfied. [sent-99, score-0.483]

53 Therefore, the new parameterization can be thought of as a generalization of linear interpolation. [sent-103, score-0.195]

54 , the cluster of ≡mod Λel order m, to ≡wh Φich the sequence belongs. [sent-110, score-0.13]

55 The lowest order distribution p1 is not interpolated with anything, hence: φm Φm(wi1−1≡), w1i−1 Λ1 ˜p1(wi|φ1) = λ1p1(wi|φ1) Now the induction step. [sent-111, score-0.141]

56 From Property 5, it follows that ⊂ φm−1, thus, for all sequences in ∀w1n ∈ φm orderJelinek-Mercne-rgramMod KNworDd-Ttr:e Eeq. [sent-112, score-0.044]

57 Percentage numbers in parentheses denote the reduction of perplexity relative to the lower order model of the same type. [sent-136, score-0.128]

58 “Word-tree” and “syntactic” refer to DT models estimated using words only (Eq. [sent-137, score-0.109]

59 Λm˜ pm(wi|φm) ;λˆm≡Λλmm = Note that the last transformation is because φm ⊂ φm−1 ; had it not been the case, p˜m would depend on the combination of φm and φm−1 and require multiple parameters to be represented on its entire domain w1n ∈ φm. [sent-144, score-0.032]

60 8) X1 Thus, we have constructed p˜ n(wi |φn) using the same recursive representation as in| Eq. [sent-147, score-0.03]

61 6, which proves that the standard linear interpolation is a special case of the new interpolation scheme, which occurs when the backoff Property 5 holds. [sent-148, score-0.992]

62 The text was converted into speech-like form, namely numbers and abbreviations were verbalized, text was downcased, punctuation was removed, and contractions and possessives were joined with the previous word (i. [sent-150, score-0.087]

63 For syntactic modeling, we used tags comprised of POS tags of the word and its head, as in (Filimonov and Harper, 2009). [sent-153, score-0.06]

64 Parsing of the text for tag extraction occurred after verbalization of numbers and abbreviations but before any further processing; we used an appropriately trained latent variable PCFG parser (Huang and Harper, 2009). [sent-154, score-0.058]

65 For reference, we include n-gram models 623 with Jelinek-Mercer and modified interpolated KN discounting. [sent-155, score-0.117]

66 All models use the same vocabulary of approximately 50k words. [sent-156, score-0.044]

67 We implemented four decision tree models3: two using the interpolation method of (Eq. [sent-157, score-0.537]

68 6) and two based on the generalized interpolation (Eq. [sent-158, score-0.356]

69 Parameters λ were estimated using the L-BFGS to minimize the entropy on a heldout set. [sent-160, score-0.032]

70 In order to eliminate the influence of all factors other than the interpolation, we used the same decision trees. [sent-161, score-0.177]

71 The perplexity results on WSJ section 23 are presented in Table 1. [sent-162, score-0.048]

72 As we have predicted, the effect of the new interpolation becomes apparent at the 4-gram order, when Property 5 is most frequently violated. [sent-163, score-0.322]

73 Note that we observe similar patterns for both word-tree and syntactic models, with syntactic models outper- forming their word-tree counterparts. [sent-164, score-0.044]

74 We believe that (Xu and Jelinek, 2004) also suffers from violation of Property 5, however, since they use a heuristic method4 to set backoff weights, it is difficult to ascertain the extent. [sent-165, score-0.386]

75 7 Conclusion The main contribution of this paper is the insight that in the standard recursive backoff there is an implied relation between the backoff and the higher order models, which is essential for adequate performance. [sent-166, score-0.762]

76 When this relation is not satisfied other interpolation methods should be employed; hence, we propose a generalization of linear interpolation that significantly outperforms the standard form in such a scenario. [sent-167, score-0.782]

77 3We refer the reader to (Filimonov and Harper, 2009) for details on the tree construction algorithm. [sent-168, score-0.155]

78 4The higher order model was discounted according to KN discounting, while the lower order model could be either a lower order DT (forest) model, or a standard n-gram model, with the former performing slightly better. [sent-169, score-0.272]

79 Interpolated estimation of markov source parameters from sparse data. [sent-199, score-0.072]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wii', 0.46), ('wi', 0.43), ('interpolation', 0.322), ('backoff', 0.315), ('bon', 0.261), ('filimonov', 0.173), ('harper', 0.142), ('decision', 0.136), ('parameterization', 0.126), ('property', 0.121), ('cluster', 0.089), ('clustering', 0.085), ('tree', 0.079), ('pm', 0.075), ('interpolated', 0.073), ('discounted', 0.071), ('violation', 0.071), ('dt', 0.068), ('mpm', 0.065), ('witi', 0.065), ('xtm', 0.065), ('pn', 0.065), ('mx', 0.063), ('mm', 0.061), ('jelinek', 0.058), ('lms', 0.058), ('dts', 0.058), ('discounting', 0.055), ('bahl', 0.053), ('context', 0.048), ('arbitrary', 0.048), ('perplexity', 0.048), ('trees', 0.048), ('recursively', 0.046), ('xn', 0.046), ('models', 0.044), ('sequences', 0.044), ('kn', 0.044), ('reader', 0.043), ('denis', 0.041), ('trivially', 0.041), ('order', 0.041), ('estimation', 0.04), ('wsj', 0.039), ('lower', 0.039), ('indeed', 0.039), ('frederick', 0.038), ('mary', 0.037), ('satisfied', 0.036), ('generalization', 0.036), ('fluent', 0.034), ('generalized', 0.034), ('distributions', 0.034), ('june', 0.034), ('abbreviations', 0.033), ('refer', 0.033), ('linear', 0.033), ('relation', 0.033), ('parameters', 0.032), ('estimated', 0.032), ('face', 0.031), ('clusters', 0.031), ('tags', 0.03), ('recursive', 0.03), ('hence', 0.03), ('pcfg', 0.029), ('ending', 0.029), ('th', 0.029), ('eligible', 0.029), ('wave', 0.029), ('kneserney', 0.029), ('settle', 0.029), ('sai', 0.029), ('dofs', 0.029), ('ient', 0.029), ('jelinekmercer', 0.029), ('lalit', 0.029), ('oafr', 0.029), ('insight', 0.028), ('distribution', 0.027), ('ll', 0.027), ('unt', 0.027), ('joined', 0.027), ('faith', 0.027), ('possessives', 0.027), ('backed', 0.027), ('heeman', 0.027), ('necessitating', 0.027), ('oefn', 0.027), ('souza', 0.027), ('xu', 0.026), ('peter', 0.026), ('function', 0.026), ('belongs', 0.026), ('compromise', 0.025), ('nat', 0.025), ('ios', 0.025), ('verbalization', 0.025), ('umd', 0.025), ('suboptimal', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

2 0.15270455 175 acl-2011-Integrating history-length interpolation and classes in language modeling

Author: Hinrich Schutze

Abstract: Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.

3 0.14681113 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Author: Brian Roark ; Richard Sproat ; Izhak Shafran

Abstract: In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al., 2003). For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ?αPh(Pw( |w h )| h0) oift hc(ehrwwis)e > 0 (1) where P is an empirical estimate of the probability that reserves small finite probability for unseen n-grams; αh is a backoff weight that ensures normalization; and h0 is a backoff history typically achieved by excising the earliest word in the history h. The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i.e., count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. This is a massive space savings, and such an approach is also used for non-probabilistic stochastic language 1 models, such as those trained with the perceptron algorithm (Roark et al., 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. Similar issues hold for other finite-state se- quence processing problems, e.g., tagging, bracketing or segmenting. Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). This compact implicit representation cannot generally be preserved when composing with other models, e.g., when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al., 2002). Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al., 2002; Allauzen et al., 2003), allowing large transducers to be composed and optimized off-line. These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e.g., word lattices) can be rescored with exact language models encoded with failure transitions. Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. Yet discriminative models may include complex features that combine both input stream (word) and output stream (tag) sequences in a single feature, yielding complicated transducer topologies for which effective use of failure transitions may not Proceedings Pofo trhtlea 4nd9,th O Arnegnouna,l J Muneeet 1in9g-2 o4f, t 2h0e1 A1s.s ?oc ci2a0t1io1n A fosrso Ccioamtiopnut faotrio Cnoaml Lpiuntgauti osntiacls: Lsihnogrutpisatipcesrs, pages 1–5, be possible. An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i.e., tmhein t riosp tihcea l⊕ s omfi trhineg gs(e mRi∪rin{g∞ (w},imthi nid,e+nt,i∞ty ,∞0)), ia.end., m+ ins tihse t ⊗e o⊕f othfe t hseem seirminigri n(wg i(wth tidhe indteitnyt t0y). ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s).in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. A hW1 , W2 . . . Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . . . Wn, must observe the path property (Mohri, 2002). The path property of a semiring K is defined in terms of the natural order on K such that: a <2 ws e3 w&o4;)r The term “lexicographic” is an apt term for this semiring since the comparison for ⊕ is like the lexicseomgriaripnhgic s icnocmep thareis coonm opfa srtisrionngs f,o rco ⊕m ipsa lrikineg t thhee l feixrist- elements, then the second, and so forth. 3 Language model encoding 3.1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. This will depend on the Markov order of the n-gram model. For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. Wixe o cfa hll othf ele ndgestthin |hat|io −n 1s,ta wtei ah bwaecikgohtff s−taltoe.g αThis recursive backoff topology terminates at the unigram state, i.e., h = ?, no history. Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. This means that no n-gram arc can enter tohred ezre rko−eth1. .o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. — s—to 3.2 Encoding with lexicographic semiring For an LM machine M on the tropical semiring with failure transitions, which is deterministic and has the wih-2 =i1wφ/-logPwα(hi-1|whiφ)/-logwPhiα(+-1w i=|-1wiφ)/-logPαw(hi+)1 Figure 1: Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol φ labels backoff transitions. Modified from Roark and Sproat (2007), Figure 6.1. path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. If the state is non-final, h∞, ∞i . sOttruhectrewdis aes fifo litl ofiwnsa.l Iwf tihthe e sxtiatt ec iosst n co nit- fwinilall ,b he∞ ∞h0,,∞ ∞cii . hLeertw n sbee tfh iet length oithf th exei longest history string iin. the model. For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ?-arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. tFroorp iecxaalm sepmlei,r i nng a, t⊗rigi sra +m, msoo Φdel, if we= =ar (en b −ac kki)nΦg. off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. For these aveutrotm thaotsae, mwoed seilmsp toly t uese hT a, Tdeif saeumlt rtrinagn.sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . For example, given a word lattice wL,e iwghe tco h0n,vceir.t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. By removing epsilon transitions and determ∩in Mizing the result, the low cost path for any given string will be retained in the result, which will correspond to the path achieved with Finally we project the second dimension of the hT, Ti weights to produce a lattice dini mtheen strioonpi ocfal t seem hTir,iTngi, wweihgichhts i tso e pqruoidvuacleen at ltaot tichee 3 result of L ∩ M, i.e., φ-arcs. C2(det(eps-rem(L0 ∩ M0))) = L ∩ M where C2 denotes projecting the second-dimension wofh tehree ChT, Ti weights, det(·) denotes determinizatoifon t,h aen hdT e,pTsi-r wemei(g·h) sde,n doette(s· )?- dreenmootveasl. d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . Therefore determinization Sofh othrtee rsetsPualttihn(gM Mint ∩er Nse)c.ti Tonh rafefteorr e?- dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. Intuitively, since the first dimension of the hT, Ti weights is 0 for n-gram arcs and > 0 foofr t h beac hkTo,ffT arcs, tghhet ss ihsor 0te fostr p na-tghr awmil la rtcrasv aenrdse > >the 0 fewest possible backoff arcs; further, since higherorder backoff arcs cost less in the first dimension of the hT, Ti weights in M0, the shortest path will intchleud heT n-gram iagrhcst sa ti nth Meir earliest possible point. We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. Inductive step: Now suppose that p/p0 visits s0...si/s00...si0 and we have therefore reached si/si0 in the respective machines. Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. nW p (i.e., the path becomes s0...sisj) the equivalent state s0 is visited on p0 (i.e., the path becomes s00...si0s0j). Let w be the next symbol to be matched leaving states si and si0. There are four cases to consider: (1) there is an n-gram arc leaving states si and si0 labeled with w, but no backoff arc leaving the state; (2) there is no n-gram arc labeled with w leaving the states, but there is a backoff arc; (3) there is no ngram arc labeled with w and no backoff arc leaving the states; and (4) there is both an n-gram arc labeled with w and a backoff arc leaving the states. In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3.2, these transitions will point to sj and s0j respectively. Case (3) leads to failure of intersection with either machine. This leaves case (4) to consider. In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ?, can be traversed. What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. In order to emit w by taking the backoff arc out of state si0, one or more backoff (?) transitions must be taken, followed by an n-gram arc labeled with w. Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. If we tirsa vhe(rns e− km) Φ b,a−ckloofgf( αarcs) ip irnior o tro eemmiitrtiinngg. the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3.2. Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. Note that, by definition, m ≤ k, and k − m + 1 is the orbdeyr oeffi nstitaitoen ,sl0 m. B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. Therefore the shortest path on M0 must follow s00...si0s0j. 2 This completes the proof. 5 Experimental Comparison of ?, φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). The lexicographic semiring was evaluated on the development 4 set (2.6 hours of broadcast news and conversations; 18K words). The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. The language model consisting of 122M n-grams was estimated by interpolation of 14 components. The vocabulary is relatively large at 737K and the associated dictionary has only single pronunciations. The language model was converted to the automaton topology described earlier, and represented in three ways: first as an approximation of a failure machine using epsilons instead of failure arcs; second as a correct failure machine; and third using the lexicographic construction derived in this paper. The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. The overall error rate for the systems was 24.8%—comparable to the state-of-theart on this task1 . For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. For full lattices, 42 (4.7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. In terms of size, the failure LM, with 5.7 million arcs requires 97 Mb. The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb.s 1 T20h eM ebq,u idvuael eton tth heT ,dToui-bling of the size of the weights.2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel?R Xeon?R CPU, and took the mean times for each of our methods. The 888 lattices were processed with a mean of 1.62 seconds in total (1.8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e.8 L Msec;o unsdinsg g(2 t.h0e m hTse,cT per lxaitctiocger)a, pahnidc is thus about 11% slower. Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1.17 seconds (1.3 msec per lattice). The 1The error rate is a couple of points higher than in Lehr and Shafran (2011) since we discarded non-lexical words, which are absent in maximum likelihood estimated language model and are typically augmented to the unigram backoff state with an arbitrary cost, fine-tuned to optimize performance for a given task. 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al., 2007). The methods described here are part of the NGram language-model-training toolkit, soon to be released at opengrm .org. Acknowledgments This research was supported in part by NSF Grant #IIS-081 1745 and DARPA grant #HR001 1-09-10041. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF or DARPA. We thank Maider Lehr for help in preparing the test data. We also thank the ACL reviewers for valuable comments. References Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 40–47. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata (CIAA 2007), Lecture Notes in Computer Sci5 ence, volume 4793, pages 11–23, Prague, Czech Republic. Springer. Jonathan Golan. 1999. Semirings and their Applications. Kluwer Academic Publishers, Dordrecht. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. SpringerVerlag, Berlin, Germany. Maider Lehr and Izhak Shafran. 2011. Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, July. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88. Mehryar Mohri. 2002. Semiring framework and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350. Brian Roark and Richard Sproat. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press, Oxford. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392.

4 0.10295612 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens

Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.

5 0.10043386 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

Author: Joel Lang

Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.

6 0.093214348 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

7 0.087339006 14 acl-2011-A Hierarchical Model of Web Summaries

8 0.085067384 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

9 0.081546105 116 acl-2011-Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers

10 0.066926695 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing

11 0.066712297 82 acl-2011-Content Models with Attitude

12 0.063984618 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

13 0.060860179 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

14 0.060438711 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

15 0.059185479 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

16 0.056066461 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

17 0.054317277 204 acl-2011-Learning Word Vectors for Sentiment Analysis

18 0.053721007 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

19 0.0530379 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

20 0.04730751 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.133), (1, -0.014), (2, -0.02), (3, -0.03), (4, -0.003), (5, -0.031), (6, -0.039), (7, 0.036), (8, -0.012), (9, 0.068), (10, 0.01), (11, -0.021), (12, 0.031), (13, 0.091), (14, 0.05), (15, -0.022), (16, -0.151), (17, 0.04), (18, 0.015), (19, 0.01), (20, 0.091), (21, -0.106), (22, 0.084), (23, -0.147), (24, -0.0), (25, -0.088), (26, 0.076), (27, -0.01), (28, -0.044), (29, -0.026), (30, -0.001), (31, -0.094), (32, -0.062), (33, -0.031), (34, 0.065), (35, -0.066), (36, 0.009), (37, -0.033), (38, 0.06), (39, -0.088), (40, -0.009), (41, -0.112), (42, 0.112), (43, 0.015), (44, -0.023), (45, 0.023), (46, -0.02), (47, -0.003), (48, 0.041), (49, 0.125)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96314806 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

2 0.81499642 175 acl-2011-Integrating history-length interpolation and classes in language modeling

Author: Hinrich Schutze

3 0.69080067 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models

Author: Greg Durrett ; Dan Klein

Abstract: We investigate the empirical behavior of ngram discounts within and across domains. When a language model is trained and evaluated on two corpora from exactly the same domain, discounts are roughly constant, matching the assumptions of modified Kneser-Ney LMs. However, when training and test corpora diverge, the empirical discount grows essentially as a linear function of the n-gram count. We adapt a Kneser-Ney language model to incorporate such growing discounts, resulting in perplexity improvements over modified Kneser-Ney and Jelinek-Mercer baselines.

4 0.67497098 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Author: Brian Roark ; Richard Sproat ; Izhak Shafran

5 0.64556742 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

Author: Joel Lang

6 0.6166082 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

7 0.54380661 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

8 0.51513565 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

9 0.48589063 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

10 0.46731526 116 acl-2011-Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers

11 0.45565498 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

12 0.43774104 82 acl-2011-Content Models with Attitude

13 0.4302237 135 acl-2011-Faster and Smaller N-Gram Language Models

14 0.42544547 14 acl-2011-A Hierarchical Model of Web Summaries

15 0.41910034 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

16 0.39487538 319 acl-2011-Unsupervised Decomposition of a Document into Authorial Components

17 0.38753155 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

18 0.38617644 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

19 0.38548243 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

20 0.37554142 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.015), (17, 0.033), (26, 0.011), (37, 0.08), (39, 0.043), (41, 0.033), (55, 0.043), (59, 0.024), (72, 0.455), (91, 0.031), (96, 0.146)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9186151 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya

Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.

2 0.87510163 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

Abstract: This short paper introduces an implemented and evaluated monolingual Text-to-Text generation system. The system takes monologue and transforms it to two-participant dialogue. After briefly motivating the task of monologue-to-dialogue generation, we describe the system and present an evaluation in terms of fluency and accuracy.

same-paper 3 0.87503183 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

4 0.87161803 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

5 0.82780898 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

6 0.8116594 261 acl-2011-Recognizing Named Entities in Tweets

7 0.79742551 252 acl-2011-Prototyping virtual instructors from human-human corpora

8 0.72647244 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

9 0.65323979 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

10 0.59114146 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11 0.58527786 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

12 0.58061862 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

13 0.57798934 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

14 0.57287335 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

15 0.55999458 141 acl-2011-Gappy Phrasal Alignment By Agreement

16 0.55743438 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

17 0.55463398 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

18 0.55013305 8 acl-2011-A Corpus of Scope-disambiguated English Text

19 0.54981148 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

20 0.54886782 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model