acl acl2011 acl2011-210 knowledge-graph by maker-knowledge-mining

210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models


Source: pdf

Author: Brian Roark ; Richard Sproat ; Izhak Shafran

Abstract: In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al., 2003). For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ?αPh(Pw( |w h )| h0) oift hc(ehrwwis)e > 0 (1) where P is an empirical estimate of the probability that reserves small finite probability for unseen n-grams; αh is a backoff weight that ensures normalization; and h0 is a backoff history typically achieved by excising the earliest word in the history h. The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i.e., count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. This is a massive space savings, and such an approach is also used for non-probabilistic stochastic language 1 models, such as those trained with the perceptron algorithm (Roark et al., 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. Similar issues hold for other finite-state se- quence processing problems, e.g., tagging, bracketing or segmenting. Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). This compact implicit representation cannot generally be preserved when composing with other models, e.g., when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al., 2002). Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al., 2002; Allauzen et al., 2003), allowing large transducers to be composed and optimized off-line. These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e.g., word lattices) can be rescored with exact language models encoded with failure transitions. Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. Yet discriminative models may include complex features that combine both input stream (word) and output stream (tag) sequences in a single feature, yielding complicated transducer topologies for which effective use of failure transitions may not Proceedings Pofo trhtlea 4nd9,th O Arnegnouna,l J Muneeet 1in9g-2 o4f, t 2h0e1 A1s.s ?oc ci2a0t1io1n A fosrso Ccioamtiopnut faotrio Cnoaml Lpiuntgauti osntiacls: Lsihnogrutpisatipcesrs, pages 1–5, be possible. An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i.e., tmhein t riosp tihcea l⊕ s omfi trhineg gs(e mRi∪rin{g∞ (w},imthi nid,e+nt,i∞ty ,∞0)), ia.end., m+ ins tihse t ⊗e o⊕f othfe t hseem seirminigri n(wg i(wth tidhe indteitnyt t0y). ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s).in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. A hW1 , W2 . . . Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . . . Wn, must observe the path property (Mohri, 2002). The path property of a semiring K is defined in terms of the natural order on K such that: a <2 ws e3 w&o4;)r The term “lexicographic” is an apt term for this semiring since the comparison for ⊕ is like the lexicseomgriaripnhgic s icnocmep thareis coonm opfa srtisrionngs f,o rco ⊕m ipsa lrikineg t thhee l feixrist- elements, then the second, and so forth. 3 Language model encoding 3.1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. This will depend on the Markov order of the n-gram model. For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. Wixe o cfa hll othf ele ndgestthin |hat|io −n 1s,ta wtei ah bwaecikgohtff s−taltoe.g αThis recursive backoff topology terminates at the unigram state, i.e., h = ?, no history. Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. This means that no n-gram arc can enter tohred ezre rko−eth1. .o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. — s—to 3.2 Encoding with lexicographic semiring For an LM machine M on the tropical semiring with failure transitions, which is deterministic and has the wih-2 =i1wφ/-logPwα(hi-1|whiφ)/-logwPhiα(+-1w i=|-1wiφ)/-logPαw(hi+)1 Figure 1: Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol φ labels backoff transitions. Modified from Roark and Sproat (2007), Figure 6.1. path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. If the state is non-final, h∞, ∞i . sOttruhectrewdis aes fifo litl ofiwnsa.l Iwf tihthe e sxtiatt ec iosst n co nit- fwinilall ,b he∞ ∞h0,,∞ ∞cii . hLeertw n sbee tfh iet length oithf th exei longest history string iin. the model. For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ?-arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. tFroorp iecxaalm sepmlei,r i nng a, t⊗rigi sra +m, msoo Φdel, if we= =ar (en b −ac kki)nΦg. off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. For these aveutrotm thaotsae, mwoed seilmsp toly t uese hT a, Tdeif saeumlt rtrinagn.sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . For example, given a word lattice wL,e iwghe tco h0n,vceir.t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. By removing epsilon transitions and determ∩in Mizing the result, the low cost path for any given string will be retained in the result, which will correspond to the path achieved with Finally we project the second dimension of the hT, Ti weights to produce a lattice dini mtheen strioonpi ocfal t seem hTir,iTngi, wweihgichhts i tso e pqruoidvuacleen at ltaot tichee 3 result of L ∩ M, i.e., φ-arcs. C2(det(eps-rem(L0 ∩ M0))) = L ∩ M where C2 denotes projecting the second-dimension wofh tehree ChT, Ti weights, det(·) denotes determinizatoifon t,h aen hdT e,pTsi-r wemei(g·h) sde,n doette(s· )?- dreenmootveasl. d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . Therefore determinization Sofh othrtee rsetsPualttihn(gM Mint ∩er Nse)c.ti Tonh rafefteorr e?- dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. Intuitively, since the first dimension of the hT, Ti weights is 0 for n-gram arcs and > 0 foofr t h beac hkTo,ffT arcs, tghhet ss ihsor 0te fostr p na-tghr awmil la rtcrasv aenrdse > >the 0 fewest possible backoff arcs; further, since higherorder backoff arcs cost less in the first dimension of the hT, Ti weights in M0, the shortest path will intchleud heT n-gram iagrhcst sa ti nth Meir earliest possible point. We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. Inductive step: Now suppose that p/p0 visits s0...si/s00...si0 and we have therefore reached si/si0 in the respective machines. Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. nW p (i.e., the path becomes s0...sisj) the equivalent state s0 is visited on p0 (i.e., the path becomes s00...si0s0j). Let w be the next symbol to be matched leaving states si and si0. There are four cases to consider: (1) there is an n-gram arc leaving states si and si0 labeled with w, but no backoff arc leaving the state; (2) there is no n-gram arc labeled with w leaving the states, but there is a backoff arc; (3) there is no ngram arc labeled with w and no backoff arc leaving the states; and (4) there is both an n-gram arc labeled with w and a backoff arc leaving the states. In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3.2, these transitions will point to sj and s0j respectively. Case (3) leads to failure of intersection with either machine. This leaves case (4) to consider. In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ?, can be traversed. What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. In order to emit w by taking the backoff arc out of state si0, one or more backoff (?) transitions must be taken, followed by an n-gram arc labeled with w. Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. If we tirsa vhe(rns e− km) Φ b,a−ckloofgf( αarcs) ip irnior o tro eemmiitrtiinngg. the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3.2. Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. Note that, by definition, m ≤ k, and k − m + 1 is the orbdeyr oeffi nstitaitoen ,sl0 m. B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. Therefore the shortest path on M0 must follow s00...si0s0j. 2 This completes the proof. 5 Experimental Comparison of ?, φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). The lexicographic semiring was evaluated on the development 4 set (2.6 hours of broadcast news and conversations; 18K words). The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. The language model consisting of 122M n-grams was estimated by interpolation of 14 components. The vocabulary is relatively large at 737K and the associated dictionary has only single pronunciations. The language model was converted to the automaton topology described earlier, and represented in three ways: first as an approximation of a failure machine using epsilons instead of failure arcs; second as a correct failure machine; and third using the lexicographic construction derived in this paper. The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. The overall error rate for the systems was 24.8%—comparable to the state-of-theart on this task1 . For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. For full lattices, 42 (4.7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. In terms of size, the failure LM, with 5.7 million arcs requires 97 Mb. The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb.s 1 T20h eM ebq,u idvuael eton tth heT ,dToui-bling of the size of the weights.2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel?R Xeon?R CPU, and took the mean times for each of our methods. The 888 lattices were processed with a mean of 1.62 seconds in total (1.8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e.8 L Msec;o unsdinsg g(2 t.h0e m hTse,cT per lxaitctiocger)a, pahnidc is thus about 11% slower. Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1.17 seconds (1.3 msec per lattice). The 1The error rate is a couple of points higher than in Lehr and Shafran (2011) since we discarded non-lexical words, which are absent in maximum likelihood estimated language model and are typically augmented to the unigram backoff state with an arbitrary cost, fine-tuned to optimize performance for a given task. 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al., 2007). The methods described here are part of the NGram language-model-training toolkit, soon to be released at opengrm .org. Acknowledgments This research was supported in part by NSF Grant #IIS-081 1745 and DARPA grant #HR001 1-09-10041. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF or DARPA. We thank Maider Lehr for help in preparing the test data. We also thank the ACL reviewers for valuable comments. References Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 40–47. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata (CIAA 2007), Lecture Notes in Computer Sci5 ence, volume 4793, pages 11–23, Prague, Czech Republic. Springer. Jonathan Golan. 1999. Semirings and their Applications. Kluwer Academic Publishers, Dordrecht. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. SpringerVerlag, Berlin, Germany. Maider Lehr and Izhak Shafran. 2011. Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, July. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88. Mehryar Mohri. 2002. Semiring framework and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350. Brian Roark and Richard Sproat. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press, Oxford. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu , , Abstract In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. [sent-3, score-0.702]

2 We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. [sent-4, score-1.111]

3 This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. [sent-5, score-0.607]

4 We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. [sent-6, score-1.032]

5 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al. [sent-7, score-0.43]

6 For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ? [sent-9, score-0.5]

7 The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i. [sent-11, score-0.26]

8 , count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. [sent-13, score-0.032]

9 , 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. [sent-15, score-0.029]

10 Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. [sent-19, score-0.093]

11 During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). [sent-20, score-0.921]

12 This compact implicit representation cannot generally be preserved when composing with other models, e. [sent-21, score-0.031]

13 , when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al. [sent-23, score-0.032]

14 Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. [sent-25, score-0.061]

15 In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al. [sent-26, score-0.837]

16 , 2003), allowing large transducers to be composed and optimized off-line. [sent-28, score-0.087]

17 These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e. [sent-29, score-0.087]

18 , word lattices) can be rescored with exact language models encoded with failure transitions. [sent-31, score-0.365]

19 Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. [sent-32, score-0.533]

20 An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. [sent-36, score-0.144]

21 In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. [sent-37, score-1.806]

22 Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. [sent-38, score-1.076]

23 In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. [sent-39, score-0.098]

24 We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. [sent-40, score-0.586]

25 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). [sent-41, score-0.779]

26 A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. [sent-42, score-0.421]

27 a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i. [sent-43, score-0.5]

28 ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s). [sent-48, score-0.02]

29 in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. [sent-49, score-0.109]

30 Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . [sent-53, score-0.066]

31 Wn, must observe the path property (Mohri, 2002). [sent-56, score-0.087]

32 1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? [sent-59, score-0.63]

33 using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). [sent-60, score-0.222]

34 Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. [sent-61, score-1.101]

35 For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. [sent-62, score-1.214]

36 For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. [sent-64, score-0.139]

37 Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. [sent-65, score-0.425]

38 As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. [sent-66, score-0.986]

39 g αThis recursive backoff topology terminates at the unigram state, i. [sent-68, score-0.495]

40 Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. [sent-72, score-0.391]

41 This means that no n-gram arc can enter tohred ezre rko−eth1. [sent-73, score-0.289]

42 o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. [sent-75, score-0.846]

43 path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. [sent-81, score-0.502]

44 ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. [sent-82, score-1.244]

45 gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. [sent-83, score-0.152]

46 hLeertw n sbee tfh iet length oithf th exei longest history string iin. [sent-87, score-0.196]

47 For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ? [sent-89, score-0.644]

48 -arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. [sent-90, score-0.442]

49 In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. [sent-91, score-0.079]

50 off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. [sent-93, score-0.174]

51 cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. [sent-94, score-0.096]

52 sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . [sent-96, score-0.189]

53 For example, given a word lattice wL,e iwghe tco h0n,vceir. [sent-97, score-0.086]

54 t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. [sent-98, score-0.772]

55 d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . [sent-104, score-0.137]

56 - dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. [sent-107, score-0.192]

57 We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. [sent-109, score-0.326]

58 Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. [sent-110, score-0.163]

59 si0 and we have therefore reached si/si0 in the respective machines. [sent-117, score-0.025]

60 Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. [sent-118, score-0.031]

61 We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. [sent-119, score-0.063]

62 sisj) the equivalent state s0 is visited on p0 (i. [sent-125, score-0.182]

63 Let w be the next symbol to be matched leaving states si and si0. [sent-131, score-0.187]

64 In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3. [sent-133, score-0.093]

65 2, these transitions will point to sj and s0j respectively. [sent-134, score-0.232]

66 Case (3) leads to failure of intersection with either machine. [sent-135, score-0.362]

67 In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. [sent-137, score-1.708]

68 However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ? [sent-138, score-0.542]

69 What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. [sent-140, score-0.837]

70 In order to emit w by taking the backoff arc out of state si0, one or more backoff (? [sent-141, score-1.163]

71 ) transitions must be taken, followed by an n-gram arc labeled with w. [sent-142, score-0.523]

72 Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. [sent-143, score-1.017]

73 the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3. [sent-145, score-0.101]

74 Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. [sent-147, score-1.148]

75 B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. [sent-149, score-0.9]

76 The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). [sent-150, score-0.443]

77 If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. [sent-151, score-1.132]

78 If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. [sent-152, score-0.981]

79 Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. [sent-153, score-0.914]

80 Therefore the shortest path on M0 must follow s00. [sent-154, score-0.149]

81 , φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). [sent-160, score-0.16]

82 The lexicographic semiring was evaluated on the development 4 set (2. [sent-161, score-0.67]

83 The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. [sent-163, score-0.12]

84 The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. [sent-167, score-0.096]

85 For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. [sent-170, score-1.422]

86 7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. [sent-172, score-0.748]

87 The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb. [sent-175, score-0.03]

88 2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel? [sent-177, score-0.096]

89 The 888 lattices were processed with a mean of 1. [sent-180, score-0.096]

90 8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e. [sent-182, score-0.379]

91 Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1. [sent-185, score-0.815]

92 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. [sent-189, score-0.046]

93 slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. [sent-190, score-0.365]

94 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. [sent-191, score-0.75]

95 Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. [sent-192, score-0.421]

96 The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. [sent-193, score-0.421]

97 We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. [sent-194, score-0.743]

98 An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al. [sent-195, score-0.67]

99 OpenFst: A general and efficient weighted finite-state transducer library. [sent-209, score-0.097]

100 Learning a discriminative weighted finite-state transducer for speech recognition. [sent-223, score-0.129]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('semiring', 0.421), ('backoff', 0.361), ('failure', 0.312), ('arc', 0.289), ('lexicographic', 0.249), ('arcs', 0.178), ('transitions', 0.169), ('ht', 0.16), ('state', 0.152), ('epsilon', 0.147), ('history', 0.139), ('topology', 0.112), ('destination', 0.103), ('allauzen', 0.099), ('lattices', 0.096), ('transition', 0.093), ('encoding', 0.091), ('lehr', 0.09), ('path', 0.087), ('transducers', 0.087), ('leaving', 0.081), ('tropical', 0.079), ('mohri', 0.078), ('automata', 0.077), ('semirings', 0.073), ('mehryar', 0.073), ('roark', 0.072), ('states', 0.069), ('msec', 0.067), ('transducer', 0.066), ('ti', 0.064), ('sj', 0.063), ('shortest', 0.062), ('automaton', 0.062), ('wfst', 0.059), ('lm', 0.057), ('shafran', 0.055), ('cost', 0.055), ('exact', 0.053), ('lattice', 0.052), ('intersection', 0.05), ('dimension', 0.046), ('brian', 0.045), ('determinization', 0.045), ('kuich', 0.045), ('maider', 0.045), ('shortestpath', 0.045), ('topologies', 0.045), ('labeled', 0.044), ('approximation', 0.04), ('het', 0.04), ('prove', 0.038), ('si', 0.037), ('openfst', 0.037), ('izhak', 0.037), ('khe', 0.037), ('weight', 0.035), ('tco', 0.034), ('km', 0.034), ('gm', 0.034), ('longest', 0.033), ('traversed', 0.033), ('fst', 0.033), ('hi', 0.032), ('speech', 0.032), ('weighted', 0.031), ('savings', 0.031), ('weights', 0.031), ('implicit', 0.031), ('composition', 0.03), ('cyril', 0.03), ('equivalent', 0.03), ('deterministic', 0.029), ('paths', 0.028), ('earliest', 0.027), ('sproat', 0.027), ('stream', 0.026), ('suffix', 0.026), ('wi', 0.026), ('det', 0.026), ('ci', 0.026), ('reach', 0.025), ('respective', 0.025), ('machines', 0.024), ('proof', 0.024), ('broadcast', 0.024), ('th', 0.024), ('log', 0.022), ('unigram', 0.022), ('followed', 0.021), ('hence', 0.021), ('sra', 0.02), ('fifo', 0.02), ('mhe', 0.02), ('rco', 0.02), ('optimizations', 0.02), ('ciaa', 0.02), ('twheh', 0.02), ('nngg', 0.02), ('wojciech', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Author: Brian Roark ; Richard Sproat ; Izhak Shafran

Abstract: In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al., 2003). For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ?αPh(Pw( |w h )| h0) oift hc(ehrwwis)e > 0 (1) where P is an empirical estimate of the probability that reserves small finite probability for unseen n-grams; αh is a backoff weight that ensures normalization; and h0 is a backoff history typically achieved by excising the earliest word in the history h. The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i.e., count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. This is a massive space savings, and such an approach is also used for non-probabilistic stochastic language 1 models, such as those trained with the perceptron algorithm (Roark et al., 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. Similar issues hold for other finite-state se- quence processing problems, e.g., tagging, bracketing or segmenting. Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). This compact implicit representation cannot generally be preserved when composing with other models, e.g., when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al., 2002). Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al., 2002; Allauzen et al., 2003), allowing large transducers to be composed and optimized off-line. These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e.g., word lattices) can be rescored with exact language models encoded with failure transitions. Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. Yet discriminative models may include complex features that combine both input stream (word) and output stream (tag) sequences in a single feature, yielding complicated transducer topologies for which effective use of failure transitions may not Proceedings Pofo trhtlea 4nd9,th O Arnegnouna,l J Muneeet 1in9g-2 o4f, t 2h0e1 A1s.s ?oc ci2a0t1io1n A fosrso Ccioamtiopnut faotrio Cnoaml Lpiuntgauti osntiacls: Lsihnogrutpisatipcesrs, pages 1–5, be possible. An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i.e., tmhein t riosp tihcea l⊕ s omfi trhineg gs(e mRi∪rin{g∞ (w},imthi nid,e+nt,i∞ty ,∞0)), ia.end., m+ ins tihse t ⊗e o⊕f othfe t hseem seirminigri n(wg i(wth tidhe indteitnyt t0y). ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s).in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. A hW1 , W2 . . . Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . . . Wn, must observe the path property (Mohri, 2002). The path property of a semiring K is defined in terms of the natural order on K such that: a <2 ws e3 w&o4;)r The term “lexicographic” is an apt term for this semiring since the comparison for ⊕ is like the lexicseomgriaripnhgic s icnocmep thareis coonm opfa srtisrionngs f,o rco ⊕m ipsa lrikineg t thhee l feixrist- elements, then the second, and so forth. 3 Language model encoding 3.1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. This will depend on the Markov order of the n-gram model. For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. Wixe o cfa hll othf ele ndgestthin |hat|io −n 1s,ta wtei ah bwaecikgohtff s−taltoe.g αThis recursive backoff topology terminates at the unigram state, i.e., h = ?, no history. Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. This means that no n-gram arc can enter tohred ezre rko−eth1. .o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. — s—to 3.2 Encoding with lexicographic semiring For an LM machine M on the tropical semiring with failure transitions, which is deterministic and has the wih-2 =i1wφ/-logPwα(hi-1|whiφ)/-logwPhiα(+-1w i=|-1wiφ)/-logPαw(hi+)1 Figure 1: Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol φ labels backoff transitions. Modified from Roark and Sproat (2007), Figure 6.1. path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. If the state is non-final, h∞, ∞i . sOttruhectrewdis aes fifo litl ofiwnsa.l Iwf tihthe e sxtiatt ec iosst n co nit- fwinilall ,b he∞ ∞h0,,∞ ∞cii . hLeertw n sbee tfh iet length oithf th exei longest history string iin. the model. For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ?-arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. tFroorp iecxaalm sepmlei,r i nng a, t⊗rigi sra +m, msoo Φdel, if we= =ar (en b −ac kki)nΦg. off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. For these aveutrotm thaotsae, mwoed seilmsp toly t uese hT a, Tdeif saeumlt rtrinagn.sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . For example, given a word lattice wL,e iwghe tco h0n,vceir.t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. By removing epsilon transitions and determ∩in Mizing the result, the low cost path for any given string will be retained in the result, which will correspond to the path achieved with Finally we project the second dimension of the hT, Ti weights to produce a lattice dini mtheen strioonpi ocfal t seem hTir,iTngi, wweihgichhts i tso e pqruoidvuacleen at ltaot tichee 3 result of L ∩ M, i.e., φ-arcs. C2(det(eps-rem(L0 ∩ M0))) = L ∩ M where C2 denotes projecting the second-dimension wofh tehree ChT, Ti weights, det(·) denotes determinizatoifon t,h aen hdT e,pTsi-r wemei(g·h) sde,n doette(s· )?- dreenmootveasl. d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . Therefore determinization Sofh othrtee rsetsPualttihn(gM Mint ∩er Nse)c.ti Tonh rafefteorr e?- dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. Intuitively, since the first dimension of the hT, Ti weights is 0 for n-gram arcs and > 0 foofr t h beac hkTo,ffT arcs, tghhet ss ihsor 0te fostr p na-tghr awmil la rtcrasv aenrdse > >the 0 fewest possible backoff arcs; further, since higherorder backoff arcs cost less in the first dimension of the hT, Ti weights in M0, the shortest path will intchleud heT n-gram iagrhcst sa ti nth Meir earliest possible point. We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. Inductive step: Now suppose that p/p0 visits s0...si/s00...si0 and we have therefore reached si/si0 in the respective machines. Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. nW p (i.e., the path becomes s0...sisj) the equivalent state s0 is visited on p0 (i.e., the path becomes s00...si0s0j). Let w be the next symbol to be matched leaving states si and si0. There are four cases to consider: (1) there is an n-gram arc leaving states si and si0 labeled with w, but no backoff arc leaving the state; (2) there is no n-gram arc labeled with w leaving the states, but there is a backoff arc; (3) there is no ngram arc labeled with w and no backoff arc leaving the states; and (4) there is both an n-gram arc labeled with w and a backoff arc leaving the states. In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3.2, these transitions will point to sj and s0j respectively. Case (3) leads to failure of intersection with either machine. This leaves case (4) to consider. In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ?, can be traversed. What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. In order to emit w by taking the backoff arc out of state si0, one or more backoff (?) transitions must be taken, followed by an n-gram arc labeled with w. Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. If we tirsa vhe(rns e− km) Φ b,a−ckloofgf( αarcs) ip irnior o tro eemmiitrtiinngg. the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3.2. Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. Note that, by definition, m ≤ k, and k − m + 1 is the orbdeyr oeffi nstitaitoen ,sl0 m. B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. Therefore the shortest path on M0 must follow s00...si0s0j. 2 This completes the proof. 5 Experimental Comparison of ?, φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). The lexicographic semiring was evaluated on the development 4 set (2.6 hours of broadcast news and conversations; 18K words). The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. The language model consisting of 122M n-grams was estimated by interpolation of 14 components. The vocabulary is relatively large at 737K and the associated dictionary has only single pronunciations. The language model was converted to the automaton topology described earlier, and represented in three ways: first as an approximation of a failure machine using epsilons instead of failure arcs; second as a correct failure machine; and third using the lexicographic construction derived in this paper. The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. The overall error rate for the systems was 24.8%—comparable to the state-of-theart on this task1 . For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. For full lattices, 42 (4.7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. In terms of size, the failure LM, with 5.7 million arcs requires 97 Mb. The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb.s 1 T20h eM ebq,u idvuael eton tth heT ,dToui-bling of the size of the weights.2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel?R Xeon?R CPU, and took the mean times for each of our methods. The 888 lattices were processed with a mean of 1.62 seconds in total (1.8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e.8 L Msec;o unsdinsg g(2 t.h0e m hTse,cT per lxaitctiocger)a, pahnidc is thus about 11% slower. Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1.17 seconds (1.3 msec per lattice). The 1The error rate is a couple of points higher than in Lehr and Shafran (2011) since we discarded non-lexical words, which are absent in maximum likelihood estimated language model and are typically augmented to the unigram backoff state with an arbitrary cost, fine-tuned to optimize performance for a given task. 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al., 2007). The methods described here are part of the NGram language-model-training toolkit, soon to be released at opengrm .org. Acknowledgments This research was supported in part by NSF Grant #IIS-081 1745 and DARPA grant #HR001 1-09-10041. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF or DARPA. We thank Maider Lehr for help in preparing the test data. We also thank the ACL reviewers for valuable comments. References Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 40–47. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata (CIAA 2007), Lecture Notes in Computer Sci5 ence, volume 4793, pages 11–23, Prague, Czech Republic. Springer. Jonathan Golan. 1999. Semirings and their Applications. Kluwer Academic Publishers, Dordrecht. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. SpringerVerlag, Berlin, Germany. Maider Lehr and Izhak Shafran. 2011. Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, July. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88. Mehryar Mohri. 2002. Semiring framework and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350. Brian Roark and Richard Sproat. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press, Oxford. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392.

2 0.16479291 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

Author: Colin Cherry ; Shane Bergsma

Abstract: Graph-based dependency parsing can be sped up significantly if implausible arcs are eliminated from the search-space before parsing begins. State-of-the-art methods for arc filtering use separate classifiers to make pointwise decisions about the tree; they label tokens with roles such as root, leaf, or attaches-tothe-left, and then filter arcs accordingly. Because these classifiers overlap substantially in their filtering consequences, we propose to train them jointly, so that each classifier can focus on the gaps of the others. We integrate the various pointwise decisions as latent variables in a single arc-level SVM classifier. This novel framework allows us to combine nine pointwise filters, and adjust their sensitivity using a shared threshold based on arc length. Our system filters 32% more arcs than the independently-trained classifiers, without reducing filtering speed. This leads to faster parsing with no reduction in accuracy.

3 0.16461754 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

Author: Y. Albert Park ; Roger Levy

Abstract: Automated grammar correction techniques have seen improvement over the years, but there is still much room for increased performance. Current correction techniques mainly focus on identifying and correcting a specific type of error, such as verb form misuse or preposition misuse, which restricts the corrections to a limited scope. We introduce a novel technique, based on a noisy channel model, which can utilize the whole sentence context to determine proper corrections. We show how to use the EM algorithm to learn the parameters of the noise model, using only a data set of erroneous sentences, given the proper language model. This frees us from the burden of acquiring a large corpora of corrected sentences. We also present a cheap and efficient way to provide automated evaluation re- sults for grammar corrections by using BLEU and METEOR, in contrast to the commonly used manual evaluations.

4 0.14681113 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

5 0.098024912 243 acl-2011-Partial Parsing from Bitext Projections

Author: Prashanth Mannem ; Aswarth Dara

Abstract: Recent work has shown how a parallel corpus can be leveraged to build syntactic parser for a target language by projecting automatic source parse onto the target sentence using word alignments. The projected target dependency parses are not always fully connected to be useful for training traditional dependency parsers. In this paper, we present a greedy non-directional parsing algorithm which doesn’t need a fully connected parse and can learn from partial parses by utilizing available structural and syntactic information in them. Our parser achieved statistically significant improvements over a baseline system that trains on only fully connected parses for Bulgarian, Spanish and Hindi. It also gave a significant improvement over previously reported results for Bulgarian and set a benchmark for Hindi.

6 0.097468406 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers

7 0.07007046 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

8 0.066651195 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

9 0.062547065 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

10 0.061712276 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

11 0.058023352 296 acl-2011-Terminal-Aware Synchronous Binarization

12 0.054207362 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers

13 0.051957477 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

14 0.050703466 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

15 0.049763598 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

16 0.049506035 175 acl-2011-Integrating history-length interpolation and classes in language modeling

17 0.049136057 135 acl-2011-Faster and Smaller N-Gram Language Models

18 0.047635794 141 acl-2011-Gappy Phrasal Alignment By Agreement

19 0.046408135 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System

20 0.046306223 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.112), (1, -0.027), (2, -0.014), (3, -0.068), (4, -0.026), (5, -0.006), (6, -0.01), (7, 0.016), (8, 0.004), (9, 0.037), (10, -0.012), (11, 0.007), (12, 0.011), (13, 0.071), (14, 0.012), (15, 0.063), (16, -0.042), (17, 0.022), (18, -0.002), (19, -0.019), (20, 0.065), (21, -0.048), (22, 0.041), (23, -0.038), (24, -0.004), (25, -0.121), (26, -0.007), (27, -0.003), (28, -0.095), (29, -0.005), (30, 0.04), (31, -0.048), (32, -0.004), (33, 0.036), (34, 0.044), (35, -0.044), (36, 0.031), (37, 0.112), (38, 0.049), (39, -0.091), (40, -0.027), (41, -0.258), (42, 0.103), (43, -0.001), (44, -0.127), (45, -0.022), (46, -0.118), (47, -0.231), (48, 0.043), (49, 0.136)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96939868 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Author: Brian Roark ; Richard Sproat ; Izhak Shafran

Abstract: In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al., 2003). For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ?αPh(Pw( |w h )| h0) oift hc(ehrwwis)e > 0 (1) where P is an empirical estimate of the probability that reserves small finite probability for unseen n-grams; αh is a backoff weight that ensures normalization; and h0 is a backoff history typically achieved by excising the earliest word in the history h. The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i.e., count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. This is a massive space savings, and such an approach is also used for non-probabilistic stochastic language 1 models, such as those trained with the perceptron algorithm (Roark et al., 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. Similar issues hold for other finite-state se- quence processing problems, e.g., tagging, bracketing or segmenting. Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). This compact implicit representation cannot generally be preserved when composing with other models, e.g., when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al., 2002). Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al., 2002; Allauzen et al., 2003), allowing large transducers to be composed and optimized off-line. These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e.g., word lattices) can be rescored with exact language models encoded with failure transitions. Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. Yet discriminative models may include complex features that combine both input stream (word) and output stream (tag) sequences in a single feature, yielding complicated transducer topologies for which effective use of failure transitions may not Proceedings Pofo trhtlea 4nd9,th O Arnegnouna,l J Muneeet 1in9g-2 o4f, t 2h0e1 A1s.s ?oc ci2a0t1io1n A fosrso Ccioamtiopnut faotrio Cnoaml Lpiuntgauti osntiacls: Lsihnogrutpisatipcesrs, pages 1–5, be possible. An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i.e., tmhein t riosp tihcea l⊕ s omfi trhineg gs(e mRi∪rin{g∞ (w},imthi nid,e+nt,i∞ty ,∞0)), ia.end., m+ ins tihse t ⊗e o⊕f othfe t hseem seirminigri n(wg i(wth tidhe indteitnyt t0y). ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s).in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. A hW1 , W2 . . . Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . . . Wn, must observe the path property (Mohri, 2002). The path property of a semiring K is defined in terms of the natural order on K such that: a <2 ws e3 w&o4;)r The term “lexicographic” is an apt term for this semiring since the comparison for ⊕ is like the lexicseomgriaripnhgic s icnocmep thareis coonm opfa srtisrionngs f,o rco ⊕m ipsa lrikineg t thhee l feixrist- elements, then the second, and so forth. 3 Language model encoding 3.1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. This will depend on the Markov order of the n-gram model. For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. Wixe o cfa hll othf ele ndgestthin |hat|io −n 1s,ta wtei ah bwaecikgohtff s−taltoe.g αThis recursive backoff topology terminates at the unigram state, i.e., h = ?, no history. Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. This means that no n-gram arc can enter tohred ezre rko−eth1. .o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. — s—to 3.2 Encoding with lexicographic semiring For an LM machine M on the tropical semiring with failure transitions, which is deterministic and has the wih-2 =i1wφ/-logPwα(hi-1|whiφ)/-logwPhiα(+-1w i=|-1wiφ)/-logPαw(hi+)1 Figure 1: Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol φ labels backoff transitions. Modified from Roark and Sproat (2007), Figure 6.1. path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. If the state is non-final, h∞, ∞i . sOttruhectrewdis aes fifo litl ofiwnsa.l Iwf tihthe e sxtiatt ec iosst n co nit- fwinilall ,b he∞ ∞h0,,∞ ∞cii . hLeertw n sbee tfh iet length oithf th exei longest history string iin. the model. For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ?-arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. tFroorp iecxaalm sepmlei,r i nng a, t⊗rigi sra +m, msoo Φdel, if we= =ar (en b −ac kki)nΦg. off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. For these aveutrotm thaotsae, mwoed seilmsp toly t uese hT a, Tdeif saeumlt rtrinagn.sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . For example, given a word lattice wL,e iwghe tco h0n,vceir.t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. By removing epsilon transitions and determ∩in Mizing the result, the low cost path for any given string will be retained in the result, which will correspond to the path achieved with Finally we project the second dimension of the hT, Ti weights to produce a lattice dini mtheen strioonpi ocfal t seem hTir,iTngi, wweihgichhts i tso e pqruoidvuacleen at ltaot tichee 3 result of L ∩ M, i.e., φ-arcs. C2(det(eps-rem(L0 ∩ M0))) = L ∩ M where C2 denotes projecting the second-dimension wofh tehree ChT, Ti weights, det(·) denotes determinizatoifon t,h aen hdT e,pTsi-r wemei(g·h) sde,n doette(s· )?- dreenmootveasl. d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . Therefore determinization Sofh othrtee rsetsPualttihn(gM Mint ∩er Nse)c.ti Tonh rafefteorr e?- dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. Intuitively, since the first dimension of the hT, Ti weights is 0 for n-gram arcs and > 0 foofr t h beac hkTo,ffT arcs, tghhet ss ihsor 0te fostr p na-tghr awmil la rtcrasv aenrdse > >the 0 fewest possible backoff arcs; further, since higherorder backoff arcs cost less in the first dimension of the hT, Ti weights in M0, the shortest path will intchleud heT n-gram iagrhcst sa ti nth Meir earliest possible point. We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. Inductive step: Now suppose that p/p0 visits s0...si/s00...si0 and we have therefore reached si/si0 in the respective machines. Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. nW p (i.e., the path becomes s0...sisj) the equivalent state s0 is visited on p0 (i.e., the path becomes s00...si0s0j). Let w be the next symbol to be matched leaving states si and si0. There are four cases to consider: (1) there is an n-gram arc leaving states si and si0 labeled with w, but no backoff arc leaving the state; (2) there is no n-gram arc labeled with w leaving the states, but there is a backoff arc; (3) there is no ngram arc labeled with w and no backoff arc leaving the states; and (4) there is both an n-gram arc labeled with w and a backoff arc leaving the states. In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3.2, these transitions will point to sj and s0j respectively. Case (3) leads to failure of intersection with either machine. This leaves case (4) to consider. In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ?, can be traversed. What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. In order to emit w by taking the backoff arc out of state si0, one or more backoff (?) transitions must be taken, followed by an n-gram arc labeled with w. Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. If we tirsa vhe(rns e− km) Φ b,a−ckloofgf( αarcs) ip irnior o tro eemmiitrtiinngg. the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3.2. Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. Note that, by definition, m ≤ k, and k − m + 1 is the orbdeyr oeffi nstitaitoen ,sl0 m. B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. Therefore the shortest path on M0 must follow s00...si0s0j. 2 This completes the proof. 5 Experimental Comparison of ?, φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). The lexicographic semiring was evaluated on the development 4 set (2.6 hours of broadcast news and conversations; 18K words). The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. The language model consisting of 122M n-grams was estimated by interpolation of 14 components. The vocabulary is relatively large at 737K and the associated dictionary has only single pronunciations. The language model was converted to the automaton topology described earlier, and represented in three ways: first as an approximation of a failure machine using epsilons instead of failure arcs; second as a correct failure machine; and third using the lexicographic construction derived in this paper. The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. The overall error rate for the systems was 24.8%—comparable to the state-of-theart on this task1 . For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. For full lattices, 42 (4.7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. In terms of size, the failure LM, with 5.7 million arcs requires 97 Mb. The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb.s 1 T20h eM ebq,u idvuael eton tth heT ,dToui-bling of the size of the weights.2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel?R Xeon?R CPU, and took the mean times for each of our methods. The 888 lattices were processed with a mean of 1.62 seconds in total (1.8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e.8 L Msec;o unsdinsg g(2 t.h0e m hTse,cT per lxaitctiocger)a, pahnidc is thus about 11% slower. Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1.17 seconds (1.3 msec per lattice). The 1The error rate is a couple of points higher than in Lehr and Shafran (2011) since we discarded non-lexical words, which are absent in maximum likelihood estimated language model and are typically augmented to the unigram backoff state with an arbitrary cost, fine-tuned to optimize performance for a given task. 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al., 2007). The methods described here are part of the NGram language-model-training toolkit, soon to be released at opengrm .org. Acknowledgments This research was supported in part by NSF Grant #IIS-081 1745 and DARPA grant #HR001 1-09-10041. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF or DARPA. We thank Maider Lehr for help in preparing the test data. We also thank the ACL reviewers for valuable comments. References Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 40–47. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata (CIAA 2007), Lecture Notes in Computer Sci5 ence, volume 4793, pages 11–23, Prague, Czech Republic. Springer. Jonathan Golan. 1999. Semirings and their Applications. Kluwer Academic Publishers, Dordrecht. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. SpringerVerlag, Berlin, Germany. Maider Lehr and Izhak Shafran. 2011. Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, July. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88. Mehryar Mohri. 2002. Semiring framework and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350. Brian Roark and Richard Sproat. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press, Oxford. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392.

2 0.59425884 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers

Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta

Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.

3 0.58837545 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

4 0.56459308 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

Author: Colin Cherry ; Shane Bergsma

Abstract: Graph-based dependency parsing can be sped up significantly if implausible arcs are eliminated from the search-space before parsing begins. State-of-the-art methods for arc filtering use separate classifiers to make pointwise decisions about the tree; they label tokens with roles such as root, leaf, or attaches-tothe-left, and then filter arcs accordingly. Because these classifiers overlap substantially in their filtering consequences, we propose to train them jointly, so that each classifier can focus on the gaps of the others. We integrate the various pointwise decisions as latent variables in a single arc-level SVM classifier. This novel framework allows us to combine nine pointwise filters, and adjust their sensitivity using a shared threshold based on arc length. Our system filters 32% more arcs than the independently-trained classifiers, without reducing filtering speed. This leads to faster parsing with no reduction in accuracy.

5 0.48309788 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

Author: Roger Levy

Abstract: A system making optimal use of available information in incremental language comprehension might be expected to use linguistic knowledge together with current input to revise beliefs about previous input. Under some circumstances, such an error-correction capability might induce comprehenders to adopt grammatical analyses that are inconsistent with the true input. Here we present a formal model of how such input-unfaithful garden paths may be adopted and the difficulty incurred by their subsequent disconfirmation, combining a rational noisy-channel model of syntactic comprehension under uncertain input with the surprisal theory of incremental processing difficulty. We also present a behavioral experiment confirming the key empirical predictions of the theory.

6 0.42935246 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

7 0.41995075 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

8 0.40909636 243 acl-2011-Partial Parsing from Bitext Projections

9 0.4068093 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing

10 0.38673872 135 acl-2011-Faster and Smaller N-Gram Language Models

11 0.36512578 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems

12 0.36062488 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

13 0.35756382 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

14 0.34302893 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

15 0.34195966 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation

16 0.33935508 175 acl-2011-Integrating history-length interpolation and classes in language modeling

17 0.33590862 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

18 0.33457887 303 acl-2011-Tier-based Strictly Local Constraints for Phonology

19 0.32742646 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

20 0.32299918 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (17, 0.038), (26, 0.027), (28, 0.011), (37, 0.097), (39, 0.046), (41, 0.056), (54, 0.301), (55, 0.042), (59, 0.033), (72, 0.027), (91, 0.025), (96, 0.167), (97, 0.013), (98, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81407219 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models

Author: Brian Roark ; Richard Sproat ; Izhak Shafran

Abstract: In this paper we introduce a novel use of the lexicographic semiring and motivate its use for speech and language processing tasks. We prove that the semiring allows for exact encoding of backoff models with epsilon transitions. This allows for off-line optimization of exact models represented as large weighted finite-state transducers in contrast to implicit (on-line) failure transition representations. We present preliminary empirical results demonstrating that, even in simple intersection scenarios amenable to the use of failure transitions, the use of the more powerful lexicographic semiring is competitive in terms of time of intersection. 1 Introduction and Motivation Representing smoothed n-gram language models as weighted finite-state transducers (WFST) is most naturally done with a failure transition, which reflects the semantics of the “otherwise” formulation of smoothing (Allauzen et al., 2003). For example, the typical backoff formulation of the probability of a word w given a history h is as follows P(w | h) = ?αPh(Pw( |w h )| h0) oift hc(ehrwwis)e > 0 (1) where P is an empirical estimate of the probability that reserves small finite probability for unseen n-grams; αh is a backoff weight that ensures normalization; and h0 is a backoff history typically achieved by excising the earliest word in the history h. The principle benefit of encoding the WFST in this way is that it only requires explicitly storing n-gram transitions for observed n-grams, i.e., count greater than zero, as opposed to all possible n-grams of the given order which would be infeasible in for example large vocabulary speech recognition. This is a massive space savings, and such an approach is also used for non-probabilistic stochastic language 1 models, such as those trained with the perceptron algorithm (Roark et al., 2007), as the means to access all and exactly those features that should fire for a particular sequence in a deterministic automaton. Similar issues hold for other finite-state se- quence processing problems, e.g., tagging, bracketing or segmenting. Failure transitions, however, are an implicit method for representing a much larger explicit automaton in the case of n-gram models, all possible n-grams for that order. During composition with the model, the failure transition must be interpreted on the fly, keeping track of those symbols that have already been found leaving the original state, and only allowing failure transition traversal for symbols that have not been found (the semantics of “otherwise”). This compact implicit representation cannot generally be preserved when composing with other models, e.g., when combining a language model with a pronunciation lexicon as in widelyused FST approaches to speech recognition (Mohri et al., 2002). Moving from implicit to explicit representation when performing such a composition leads to an explosion in the size of the resulting transducer, frequently making the approach intractable. In practice, an off-line approximation to the model is made, typically by treating the failure transitions as epsilon transitions (Mohri et al., 2002; Allauzen et al., 2003), allowing large transducers to be composed and optimized off-line. These complex approximate transducers are then used during first-pass – decoding, and the resulting pruned search graphs (e.g., word lattices) can be rescored with exact language models encoded with failure transitions. Similar problems arise when building, say, POStaggers as WFST: not every pos-tag sequence will have been observed during training, hence failure transitions will achieve great savings in the size of models. Yet discriminative models may include complex features that combine both input stream (word) and output stream (tag) sequences in a single feature, yielding complicated transducer topologies for which effective use of failure transitions may not Proceedings Pofo trhtlea 4nd9,th O Arnegnouna,l J Muneeet 1in9g-2 o4f, t 2h0e1 A1s.s ?oc ci2a0t1io1n A fosrso Ccioamtiopnut faotrio Cnoaml Lpiuntgauti osntiacls: Lsihnogrutpisatipcesrs, pages 1–5, be possible. An exact encoding using other mechanisms is required in such cases to allow for off-line representation and optimization. In this paper, we introduce a novel use of a semiring the lexicographic semiring (Golan, 1999) which permits an exact encoding of these sorts of models with the same compact topology as with failure transitions, but using epsilon transitions. Unlike the standard epsilon approximation, this semiring allows for an exact representation, while also allowing (unlike failure transition approaches) for off-line – – composition with other transducers, with all the optimizations that such representations provide. In the next section, we introduce the semiring, followed by a proof that its use yields exact representations. We then conclude with a brief evaluation of the cost of intersection relative to failure transitions in comparable situations. 2 The Lexicographic Semiring Weighted automata are automata in which the transitions carry weight elements of a semiring (Kuich and Salomaa, 1986). A semiring is a ring that may lack negation, with two associative operations ⊕ and ⊗lac akn nde tghaetiiro nre,ws piethcti twveo i dasesnotictiya ievleem oepnertas t 0io annsd ⊕ ⊕ 1. a nAd ⊗com anmdo tnh esierm reirsipnegc tiivn es pideeenchti ayn edl elmanegnutasg 0e processing, and one that we will be using in this paper, is the tropical semiring (R ∪ {∞}, min, +, ∞, 0), i.e., tmhein t riosp tihcea l⊕ s omfi trhineg gs(e mRi∪rin{g∞ (w},imthi nid,e+nt,i∞ty ,∞0)), ia.end., m+ ins tihse t ⊗e o⊕f othfe t hseem seirminigri n(wg i(wth tidhe indteitnyt t0y). ∞Th)i asn ids a+pp isro thpreia ⊗te o ofof rth h pee srfeomrmiri nngg (Vwitietrhb iid seenatricthy u0s).in Tgh negative log probabilities – we add negative logs along a path and take the min between paths. A hW1 , W2 . . . Wni-lexicographic weight is a tupleA o hfW weights wherei- eeaxichco gorfa pthhiec w weeiigghhtt cisla ass teusW1, W2 . . . Wn, must observe the path property (Mohri, 2002). The path property of a semiring K is defined in terms of the natural order on K such that: a <2 ws e3 w&o4;)r The term “lexicographic” is an apt term for this semiring since the comparison for ⊕ is like the lexicseomgriaripnhgic s icnocmep thareis coonm opfa srtisrionngs f,o rco ⊕m ipsa lrikineg t thhee l feixrist- elements, then the second, and so forth. 3 Language model encoding 3.1 Standard encoding For language model encoding, we will differentiate between two classes of transitions: backoff arcs (labeled with a φ for failure, or with ? using our new semiring); and n-gram arcs (everything else, labeled with the word whose probability is assigned). Each state in the automaton represents an n-gram history string h and each n-gram arc is weighted with the (negative log) conditional probability of the word w labeling the arc given the history h. For a given history h and n-gram arc labeled with a word w, the destination of the arc is the state associated with the longest suffix of the string hw that is a history in the model. This will depend on the Markov order of the n-gram model. For example, consider the trigram model schematic shown in Figure 1, in which only history sequences of length 2 are kept in the model. Thus, from history hi = wi−2wi−1, the word wi transitions to hi+1 = wi−1wi, w2hii−ch1 is the longest suffix of hiwi in the modie−l1. As detailed in the “otherwise” semantics of equation 1, backoff arcs transition from state h to a state h0, typically the suffix of h of length |h| − 1, with we,i tgyhpti c(a−lllyog th αeh s)u. Wixe o cfa hll othf ele ndgestthin |hat|io −n 1s,ta wtei ah bwaecikgohtff s−taltoe.g αThis recursive backoff topology terminates at the unigram state, i.e., h = ?, no history. Backoff states of order k may be traversed either via φ-arcs from the higher order n-gram of order k + 1or via an n-gram arc from a lower order n-gram of order k −1. This means that no n-gram arc can enter tohred ezre rko−eth1. .o Trhdiesr mstaeaten s(fi tnhaalt bnaoc nk-ogfrfa),m ma andrc f cualln-o enrdteerr states history strings of length n − 1 for a model sotfa toersde —r n h may ihnagvse o n-gram a nrc −s e 1nt feorri nag m forodeml other full-order states as well as from backoff states of history size n − 2. — s—to 3.2 Encoding with lexicographic semiring For an LM machine M on the tropical semiring with failure transitions, which is deterministic and has the wih-2 =i1wφ/-logPwα(hi-1|whiφ)/-logwPhiα(+-1w i=|-1wiφ)/-logPαw(hi+)1 Figure 1: Deterministic finite-state representation of n-gram models with negative log probabilities (tropical semiring). The symbol φ labels backoff transitions. Modified from Roark and Sproat (2007), Figure 6.1. path property, we can simulate φ-arcs in a standard LM topology by a topologically equivalent machine M0 on the lexicographic hT, Ti semiring, where φ has boenen th hreep l eaxciceod gwraitphh eicps hilTo,nT, ais sfeomlloirwinsg. ,F worh every n-gram arc with label w and weight c, source state si and destination state sj, construct an n-gram arc with label w, weight h0, ci, source state si0, and deswtiniathtio lanb estla wte, s0j. gThhte h e0x,citi c, soosut rocfe e satcahte s state is constructed as follows. If the state is non-final, h∞, ∞i . sOttruhectrewdis aes fifo litl ofiwnsa.l Iwf tihthe e sxtiatt ec iosst n co nit- fwinilall ,b he∞ ∞h0,,∞ ∞cii . hLeertw n sbee tfh iet length oithf th exei longest history string iin. the model. For every φ-arc with (backoff) weight c, source state si, and destination state sj representing a history of length k, construct an ?-arc with source state si0, destination state s0j, and weight hΦ⊗(n−k) , ci, where Φ > 0 and Φ⊗(n−k) takes Φ to the (n − k)th power with the ⊗ operation. In the tropical semiring, ⊗ ris w +, so Φe⊗ ⊗(n o−pke) = (n − k)Φ. tFroorp iecxaalm sepmlei,r i nng a, t⊗rigi sra +m, msoo Φdel, if we= =ar (en b −ac kki)nΦg. off from a bigram state h (history length = 1) to a unigram state, n − k = 2 − 0 = 2, so we set the buanicgkroafmf w steaigteh,t nto −h2 kΦ, = =− l2og − α 0h) = =for 2 ,s soome w Φe s>et 0 th. cInk ofrfd were gtoh tco tom hb2iΦn,e −thleo gmαodel with another automaton or transducer, we would need to also convert those models to the hT, Ti semiring. For these aveutrotm thaotsae, mwoed seilmsp toly t uese hT a, Tdeif saeumlt rtrinagn.sf Foromra thtieosen such that every transition with weight c is assigned weight h0, ci . For example, given a word lattice wL,e iwghe tco h0n,vceir.t the lattice to L0 in the lexicographic semiring using this default transformation, and then perform the intersection L0 ∩ M0. By removing epsilon transitions and determ∩in Mizing the result, the low cost path for any given string will be retained in the result, which will correspond to the path achieved with Finally we project the second dimension of the hT, Ti weights to produce a lattice dini mtheen strioonpi ocfal t seem hTir,iTngi, wweihgichhts i tso e pqruoidvuacleen at ltaot tichee 3 result of L ∩ M, i.e., φ-arcs. C2(det(eps-rem(L0 ∩ M0))) = L ∩ M where C2 denotes projecting the second-dimension wofh tehree ChT, Ti weights, det(·) denotes determinizatoifon t,h aen hdT e,pTsi-r wemei(g·h) sde,n doette(s· )?- dreenmootveasl. d 4 Proof We wish to prove that for any machine N, ShortestPath(M0 ∩ N0) passes through the equivalent states in M0 to∩ t Nhose passed through in M for ShortestPath(M ∩ N) . Therefore determinization Sofh othrtee rsetsPualttihn(gM Mint ∩er Nse)c.ti Tonh rafefteorr e?- dreemteromvianl yzaiteilodns the same topology as intersection with the equivalent φ machine. Intuitively, since the first dimension of the hT, Ti weights is 0 for n-gram arcs and > 0 foofr t h beac hkTo,ffT arcs, tghhet ss ihsor 0te fostr p na-tghr awmil la rtcrasv aenrdse > >the 0 fewest possible backoff arcs; further, since higherorder backoff arcs cost less in the first dimension of the hT, Ti weights in M0, the shortest path will intchleud heT n-gram iagrhcst sa ti nth Meir earliest possible point. We prove this by induction on the state-sequence of the path p/p0 up to a given state si/si0 in the respective machines M/M0. Base case: If p/p0 is of length 0, and therefore the states si/si0 are the initial states of the respective machines, the proposition clearly holds. Inductive step: Now suppose that p/p0 visits s0...si/s00...si0 and we have therefore reached si/si0 in the respective machines. Suppose the cumulated weights of p/p0 are W and hΨ, Wi, respectively. We wish to show thaarte w Whic anhedv heΨr sj isi n reexspt evcitsiivteedly o. nW p (i.e., the path becomes s0...sisj) the equivalent state s0 is visited on p0 (i.e., the path becomes s00...si0s0j). Let w be the next symbol to be matched leaving states si and si0. There are four cases to consider: (1) there is an n-gram arc leaving states si and si0 labeled with w, but no backoff arc leaving the state; (2) there is no n-gram arc labeled with w leaving the states, but there is a backoff arc; (3) there is no ngram arc labeled with w and no backoff arc leaving the states; and (4) there is both an n-gram arc labeled with w and a backoff arc leaving the states. In cases (1) and (2), there is only one possible transition to take in either M or M0, and based on the algorithm for construction of M0 given in Section 3.2, these transitions will point to sj and s0j respectively. Case (3) leads to failure of intersection with either machine. This leaves case (4) to consider. In M, since there is a transition leaving state si labeled with w, the backoff arc, which is a failure transition, cannot be traversed, hence the destination of the n-gram arc sj will be the next state in p. However, in M0, both the n-gram transition labeled with w and the backoff transition, now labeled with ?, can be traversed. What we will now prove is that the shortest path through M0 cannot include taking the backoff arc in this case. In order to emit w by taking the backoff arc out of state si0, one or more backoff (?) transitions must be taken, followed by an n-gram arc labeled with w. Let k be the order of the history represented by state si0, hence the cost of the first backoff arc is h(n − k)Φ, −log(αsi0 )i in our semiring. If we tirsa vhe(rns e− km) Φ b,a−ckloofgf( αarcs) ip irnior o tro eemmiitrtiinngg. the w, the first dimension of our accumulated cost will be m(n −k + m−21)Φ, based on our algorithm for consmtr(unct−ionk +of M0 given in Section 3.2. Let sl0 be the destination state after traversing m backoff arcs followed by an n-gram arc labeled with w. Note that, by definition, m ≤ k, and k − m + 1 is the orbdeyr oeffi nstitaitoen ,sl0 m. B≤ as ked, onnd t khe − c mons +tru 1ct iiosn t ealg oor-rithm, the state sl0 is also reachable by first emitting w from state si0 to reach state s0j followed by some number of backoff transitions. The order of state s0j is either k (if k is the highest order in the model) or k + 1 (by extending the history of state si0 by one word). If it is of order k, then it will require m −1 backoff arcs to reach state sl0, one fewer tqhuainre t mhe− −pa1t hb ctok osftfat aer ss0l oth raeta cbheg sitanste w sith a backoff arc, for a total cost of (m − 1) (n − k + m−21)Φ which is less than m(n − k + m−21)Φ. If state s0j icish o ifs o lerdsser hka n+ m1,( th −er ke +will be m backoff arcs to reach state sl0, but with a total cost of m(n − (k + 1) + m−21)Φ m(n − k + m−23)Φ = which is also less than m(n − km + m−21)Φ. Hence twheh cstha ties asl0ls coa lne asl twhaayns mbe( n re −ac khe +d from si0 with a lower cost through state s0j than by first taking the backoff arc from si0. Therefore the shortest path on M0 must follow s00...si0s0j. 2 This completes the proof. 5 Experimental Comparison of ?, φ and hT, Ti encoded language models For our experiments we used lattices derived from a very large vocabulary continuous speech recognition system, which was built for the 2007 GALE Arabic speech recognition task, and used in the work reported in Lehr and Shafran (201 1). The lexicographic semiring was evaluated on the development 4 set (2.6 hours of broadcast news and conversations; 18K words). The 888 word lattices for the development set were generated using a competitive baseline system with acoustic models trained on about 1000 hrs of Arabic broadcast data and a 4-gram language model. The language model consisting of 122M n-grams was estimated by interpolation of 14 components. The vocabulary is relatively large at 737K and the associated dictionary has only single pronunciations. The language model was converted to the automaton topology described earlier, and represented in three ways: first as an approximation of a failure machine using epsilons instead of failure arcs; second as a correct failure machine; and third using the lexicographic construction derived in this paper. The three versions of the LM were evaluated by intersecting them with the 888 lattices of the development set. The overall error rate for the systems was 24.8%—comparable to the state-of-theart on this task1 . For the shortest paths, the failure and lexicographic machines always produced identical lattices (as determined by FST equivalence); in contrast, 81% of the shortest paths from the epsilon approximation are different, at least in terms of weights, from the shortest paths using the failure LM. For full lattices, 42 (4.7%) of the lexicographic outputs differ from the failure LM outputs, due to small floating point rounding issues; 863 (97%) of the epsilon approximation outputs differ. In terms of size, the failure LM, with 5.7 million arcs requires 97 Mb. The equivalent hT, Tillieoxnico argcrasp rheqicu iLreMs r 9e7qu Mireb.s 1 T20h eM ebq,u idvuael eton tth heT ,dToui-bling of the size of the weights.2 To measure speed, we performed the intersections 1000 times for each of our 888 lattices on a 2993 MHz Intel?R Xeon?R CPU, and took the mean times for each of our methods. The 888 lattices were processed with a mean of 1.62 seconds in total (1.8 msec per lattice) using the failure LM; using the hT, Ti-lexicographic iLnMg t rheequ fairieldur 1e.8 L Msec;o unsdinsg g(2 t.h0e m hTse,cT per lxaitctiocger)a, pahnidc is thus about 11% slower. Epsilon approximation, where the failure arcs are approximated with epsilon arcs took 1.17 seconds (1.3 msec per lattice). The 1The error rate is a couple of points higher than in Lehr and Shafran (2011) since we discarded non-lexical words, which are absent in maximum likelihood estimated language model and are typically augmented to the unigram backoff state with an arbitrary cost, fine-tuned to optimize performance for a given task. 2If size became an issue, the first dimension of the hT, TiweigIhft scizane bbee c raemprees aennt iesdsu eby, tah esi fnigrlste bdyimtee. slightly slower speeds for the exact method using the failure LM, and hT, Ti can be related to the overhfaeialdur eof L cMom, apnudtin hgT ,tThei f caailnur bee f urenlcattieodn aot rhuen toivmeer,and determinization, respectively. 6 Conclusion In this paper we have introduced a novel application of the lexicographic semiring, proved that it can be used to provide an exact encoding of language model topologies with failure arcs, and provided experimental results that demonstrate its efficiency. Since the hT, Ti-lexicographic semiring is both lefSt-i nacned hr iegh htT-d,iTstrii-bleuxtiicvoe,g roatphheirc o spetmimiriiznagtions such as minimization are possible. The particular hT, Ti-lexicographic semiring we have used thiceruel airs h bTu,t Toni-el oxifc many h piocss siebmlei ilnexgic woegr haapvheic u esendcodings. We are currently exploring the use of a lexicographic semiring that involves different semirings in the various dimensions, for the integration of part-of-speech taggers into language models. An implementation of the lexicographic semiring by the second author is already available as part of the OpenFst package (Allauzen et al., 2007). The methods described here are part of the NGram language-model-training toolkit, soon to be released at opengrm .org. Acknowledgments This research was supported in part by NSF Grant #IIS-081 1745 and DARPA grant #HR001 1-09-10041. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the NSF or DARPA. We thank Maider Lehr for help in preparing the test data. We also thank the ACL reviewers for valuable comments. References Cyril Allauzen, Mehryar Mohri, and Brian Roark. 2003. Generalized algorithms for constructing statistical language models. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 40–47. Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wojciech Skut, and Mehryar Mohri. 2007. OpenFst: A general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata (CIAA 2007), Lecture Notes in Computer Sci5 ence, volume 4793, pages 11–23, Prague, Czech Republic. Springer. Jonathan Golan. 1999. Semirings and their Applications. Kluwer Academic Publishers, Dordrecht. Werner Kuich and Arto Salomaa. 1986. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. SpringerVerlag, Berlin, Germany. Maider Lehr and Izhak Shafran. 2011. Learning a discriminative weighted finite-state transducer for speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, July. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88. Mehryar Mohri. 2002. Semiring framework and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350. Brian Roark and Richard Sproat. 2007. Computational Approaches to Morphology and Syntax. Oxford University Press, Oxford. Brian Roark, Murat Saraclar, and Michael Collins. 2007. Discriminative n-gram language modeling. Computer Speech and Language, 21(2):373–392.

2 0.77220678 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

Author: Nina Dethlefs ; Heriberto Cuayahuitl

Abstract: Surface realisation decisions in language generation can be sensitive to a language model, but also to decisions of content selection. We therefore propose the joint optimisation of content selection and surface realisation using Hierarchical Reinforcement Learning (HRL). To this end, we suggest a novel reward function that is induced from human data and is especially suited for surface realisation. It is based on a generation space in the form of a Hidden Markov Model (HMM). Results in terms of task success and human-likeness suggest that our unified approach performs better than greedy or random baselines.

3 0.71375936 311 acl-2011-Translationese and Its Dialects

Author: Moshe Koppel ; Noam Ordan

Abstract: While it is has often been observed that the product of translation is somehow different than non-translated text, scholars have emphasized two distinct bases for such differences. Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the process of translation that are independent of source language. Using a series of text categorization experiments, we show that both these effects exist and that, moreover, there is a continuum between them. There are many effects of translation that are consistent among texts translated from a given source language, some of which are consistent even among texts translated from families of source languages. Significantly, we find that even for widely unrelated source languages and multiple genres, differences between translated texts and non-translated texts are sufficient for a learned classifier to accurately determine if a given text is translated or original.

4 0.58285648 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

Author: Jason Naradowsky ; Kristina Toutanova

Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.

5 0.58096349 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

6 0.58086693 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

7 0.58057833 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

8 0.58022928 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

9 0.58013451 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

10 0.58008295 28 acl-2011-A Statistical Tree Annotator and Its Applications

11 0.57995862 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

12 0.57990229 44 acl-2011-An exponential translation model for target language morphology

13 0.57970631 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

14 0.57908332 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

15 0.57820207 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

16 0.57816231 133 acl-2011-Extracting Social Power Relationships from Natural Language

17 0.57813621 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

18 0.57813317 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

19 0.57802075 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

20 0.57713193 193 acl-2011-Language-independent compound splitting with morphological operations