acl acl2010 acl2010-116 knowledge-graph by maker-knowledge-mining

116 acl-2010-Finding Cognate Groups Using Phylogenies


Source: pdf

Author: David Hall ; Dan Klein

Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A central problem in historical linguistics is the identification of historically related cognate words. [sent-3, score-0.802]

2 We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. [sent-4, score-0.875]

3 Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. [sent-5, score-0.168]

4 We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. [sent-6, score-0.211]

5 Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words. [sent-8, score-0.217]

6 1 Introduction A crowning achievement of historical linguistics is the comparative method (Ohala, 1993), wherein linguists use word similarity to elucidate the hidden phonological and morphological processes which govern historical descent. [sent-9, score-0.169]

7 The comparative method requires reasoning about three important hidden variables: the overall phylogenetic guide tree among languages, the evolutionary parameters of the ambient changes at each branch, and the cognate group structure that specifies which words share common ancestors. [sent-10, score-0.993]

8 However, linguists are currently required to make qualitative judgments regarding the relative likelihood of certain sound changes, cognate groups, and so on. [sent-12, score-0.827]

9 These automated methods, while providing robustness and scale in the induction of ancestral word forms and evolutionary parameters, assume that cognate groups are already known. [sent-17, score-1.021]

10 In this work, we address this limitation, presenting a model in which cognate groups can be discovered automatically. [sent-18, score-0.873]

11 Finding cognate groups is not an easy task, because underlying morphological and phonological changes can obscure relationships between words, especially for distant cognates, where simple string overlap is an inadequate measure of similarity. [sent-19, score-0.905]

12 Some authors have attempted to automatically detect cognate words (Mann and Yarowsky, 2001 ; Lowe and Mazaudon, 1994; Oakes, 2000; Kondrak, 2001 ; Mulloni, 2007), but these methods 1030 Proce dingUsp opfs thaela 4, 8Stwhe Adnen u,a 1l1- M16e Jtiunlgy o 2f0 t1h0e. [sent-27, score-0.753]

13 To fully automate the comparative method, it is necessary to consider multiple languages, and to do so in a model which couples cognate detection with similarity learning. [sent-30, score-0.792]

14 In this paper, we present a new generative model for the automatic induction of cognate groups given only (1) a known family tree of languages and (2) word lists from those languages. [sent-31, score-0.992]

15 A prior on word survival generates a number of cognate groups and decides which groups are attested in each modern language. [sent-32, score-1.102]

16 Finally, an alignment model maps the flat word lists to cognate groups. [sent-34, score-0.847]

17 Inference requires a combination of message-passing in the evolutionary model and iterative bipartite graph matching in the alignment model. [sent-35, score-0.185]

18 Here, we present a new method for automatically compressing our message automata in a way that can take into account prior information about the expected outcome of inference. [sent-39, score-0.176]

19 In this paper, we focus on a transcribed word list of 583 cognate sets from three Romance languages (Portuguese, Italian and Spanish), as well as their common ancestor Latin (Bouchard-C oˆt e´ et al. [sent-40, score-0.842]

20 We consider both the case where we know that all cognate groups have a surface form in all languages, and where we do not know that. [sent-42, score-0.873]

21 2 Model In this section, we describe a new generative model for vocabulary lists in multiple related languages given the phylogenetic relationship between the languages (their family tree). [sent-48, score-0.202]

22 Survival dictates, for each cognate group, which languages have words in that group. [sent-50, score-0.8]

23 1 Survival First, we choose a number G of ancestral cognate groups from a geometric distribution. [sent-55, score-1.003]

24 For each cognate group g, our generative process walks down the tree. [sent-56, score-0.839]

25 This process is modeled in a “death tree” with a Bernoulli random variable S‘g for each language ‘ and cognate group g specifying whether or not the word died before reaching that language. [sent-58, score-0.806]

26 This process captures the intuition that cognate words are more likely to be found clustered in sibling languages than scattered across unrelated languages. [sent-60, score-0.8]

27 1 In each cognate group, each word W‘ is generated from its parent according to a conditional distribution with parameter ϕ‘, which is specific to that edge in the tree, but shared between all cognate groups. [sent-65, score-1.589]

28 In this paper, each ϕ‘ takes the form of a parameterized edit distance similar to the standard Levenshtein distance. [sent-66, score-0.205]

29 The edit transducers are represented schematically in Figure 1(b). [sent-69, score-0.271]

30 we have data for Latin, we treat 1031 Figure 1: (a) The process by which cognate words are generated. [sent-74, score-0.753]

31 Here, we show the derivation of Romance language words W‘ from their respective Latin ancestor, parameterized by transformations ϕ‘ and survival variables S‘. [sent-75, score-0.181]

32 (b) The class of parameterized edit distances used in this paper. [sent-78, score-0.232]

33 Each pair of phonemes has a weight σ for deletion, and each phoneme has weights η and δ for insertion and deletion respectively. [sent-79, score-0.24]

34 (c) A possible alignment produced by an edit distance between the Latin word focus (“hearth”) and the Italian word fuoco (“fire”). [sent-80, score-0.223]

35 ) Here, we make the simplifying assumption that in any language there is at most one word per language per cognate group. [sent-87, score-0.753]

36 Because the assignments of words to cognates is unknown, we specify an unknown alignment parameter π‘ for each modern language which is an alignment ofcognate groups to entries in the word list. [sent-88, score-0.356]

37 In the case that every cognate group has a word in each language, each π‘ is a permutation. [sent-89, score-0.806]

38 In the more general case that some cognate groups do not have words from all languages, this mapping is injective from words to cognate groups. [sent-90, score-1.626]

39 3 Inference of Cognate Assignments In this section, we discuss the inference method for determining cognate assignments under fixed parameters ϕ. [sent-94, score-0.856]

40 We are given a set of languages and a list of words in each language, and our objective is to determine which words are cognate with each other. [sent-95, score-0.8]

41 π∗= argπmaxXglogp(w(‘,π‘(g))|ϕ,π,w−‘) is the word in language ‘ that π‘ has assigned to cognate group g. [sent-97, score-0.806]

42 Computing these messages gives a posterior marginal distribution µ‘(w‘) = p(w‘ |w−‘,π−‘(g) , ϕ, π−‘), which is precisely the affinity score we need for the bipartite matching. [sent-103, score-0.216]

43 In practice, it is important to estimate better parametric edit distances ϕ‘ and survival variables S‘. [sent-109, score-0.368]

44 However, a naively constructed edit distance, which for example might penalize vowel substitutions lightly, would fail to learn that Latin words that are borrowed into English would not undergo the sound change /I/→/eI/. [sent-112, score-0.175]

45 vowels turning into other vowels is more common than vowels turning into consonants), but which changes are appropriate for a given language. [sent-116, score-0.149]

46 2 At a high level, our learning algorithm is much like Expectation Maximization with hard assignments: after we update the alignment variables π and thus form new potential cognate sets, we reestimate our model’s parameters to maximize the likelihood of those assignments. [sent-117, score-0.872]

47 For the transducers ϕ, we learn parameterized edit distances that model the probabilities of different sound changes. [sent-122, score-0.412]

48 These edit distances define a condi- 2We note two further difficulties: our model does not handle “borrowings,” which would be necessary to capture a significant portion of English vocabulary; nor can it seamlessly handle words that are inherited later in the evolution of language than others. [sent-124, score-0.252]

49 That is, for any fixed wa : Xp(wd|wa,σ) = X X score(z;σ) Xwd Xwd alignX(zw∈a,wd) = X X Y σ(x,y) = 1 where Xwd alignX(zw∈a,wd) (x,Yy)∈z align(wa, wd) is the set of possible align- ments between the phonemes in words wa and wd. [sent-131, score-0.295]

50 We are seeking the maximum likelihood estimate of each ϕ, given fixed alignments π: ϕˆ‘= argϕm‘axp(w|ϕ,π) To find this maximizer for any given π‘, we need to find a marginal distribution over the edges connecting any two languages a and d. [sent-132, score-0.152]

51 ” That is, for each pair of phonemes x and y (or empty phoneme ε), we need to find the quantity: = X #(x,y;z)p(z|wa,wd)p(wa,wd) Ep(wa,wd) [#(x, y; z)] X wXa,wd alignX(zw∈a,wd) where we denote #(x, y; z) to be the number of times the pair of phonemes (x, y) are aligned in alignment z. [sent-134, score-0.292]

52 Given the expected counts, we now need to normalize them to ensure that the transducer repre- sents a conditional probability distribution (Eisner, 2002; Oncina and Sebban, 2006). [sent-136, score-0.147]

53 5 Transducers and Automata In our model, it is not just the edit distances that are finite state machines. [sent-141, score-0.195]

54 Each edge in the graph has a real-valued weight4 and a label, which is a single phoneme in some alphabet Σ or the empty phoneme ε (resp. [sent-150, score-0.206]

55 T1 has input alphabet Σ and output alphabet ∆, while T2 has input alphabet ∆ and output alphabet Ω. [sent-160, score-0.208]

56 For two messages µ1(w) andP Pµ2 (w), the same algorithm can be used to find Pthe product µ(w) = µ1(w)µ2 (w) The third operation is transducer minimization. [sent-165, score-0.168]

57 Repeated compositions compound the problem: iterated composition of k transducers produces O(nk) states. [sent-167, score-0.172]

58 6 Message Approximation Recall that in inference, when summing out in- terior nodes wi we calculated the product over incoming messages µd→i (wi) (Equation 1), and that these products are calculated using transducer composition. [sent-172, score-0.202]

59 Unfortunately, the maximal number of states in a message is exponential in the number of words in the cognate group. [sent-173, score-0.857]

60 Thus, we need to find a way to approximate a message µ(w) using a simpler automata ˜ µ(w; θ) taken from a restricted class parameterized by θ. [sent-176, score-0.213]

61 In our setting, however, n-best lists are problem- atic; early experiments showed that a 10,000-best list for a typical message only accounts for 50% of message log perplexity. [sent-179, score-0.185]

62 tween some approximating message ˜ µ(w) and the true message µ(w). [sent-183, score-0.184]

63 6 In this paper, µ(w) is a complex automaton with potentially many states, ˜ µ(w; θ) is a simple parametric automaton with forms that we discuss below, and τ(w) is an arbitrary (but hopefully fairly simple) automaton. [sent-188, score-0.161]

64 The actual method we use is 5As an extreme example, suppose we have observed that Wd = wd and that p(WdP P= wd |wa) = P1 for Pall ancestral words wa. [sent-189, score-0.293]

65 ner (2009), which amounts to using an expectation semiring (Eisner, 2001) to compute expected transitions in τ ◦ µ˜∗ under the probability distribution τ ◦ µ. [sent-195, score-0.151]

66 iSs,in wcee we kdn toow d tihvied topology and arc weights for τ ahead of time, this is often as simple as dividing arc weights in τ ◦ ˜µ by the corresponding arc weight wine τ(w). [sent-199, score-0.223]

67 F τo r◦ example, if τ encodes a geometric distribution over word lengths and a uniform distribution over phonemes (that is, τ(w) ∝ then computing ˜µ is as simple as dividing ∝e apch arc in τ ◦ ˜µ by p. [sent-200, score-0.236]

68 In our experiments, we find that τ(w) can be a geometric distribution over lengths with a uniform distribution over phonemes and still give reasonable results. [sent-205, score-0.189]

69 What remains is the selection of the topologies for the approximating message µ˜. [sent-207, score-0.159]

70 The first is a plain unigram model, the second is a bigram model, and the third is an anchored unigram topology: a position-specific unigram model for each position up to some maximum length. [sent-209, score-0.229]

71 It is similar to the unigram topology except that, instead of a single state, we have a state for each phoneme in Σ, along with a special start state. [sent-219, score-0.219]

72 The final topology we consider is the positional unigram model in Figure 2(c). [sent-222, score-0.191]

73 Estimating the parameters of this model is similar, except that the expected counts for the phonemes in the alphabet are conditioned on their position in the string. [sent-227, score-0.161]

74 The first is a “complete data” experiment, in which we reconstitute the cognate groups from the Romance data set, where all cognate groups have words in all three languages. [sent-231, score-1.746]

75 Here, only a fraction of words appear 1036 in any cognate group, so this task crucially involves the survival model. [sent-235, score-0.862]

76 The ultimate purpose of the induced cognate groups is to feed richer evolutionary models, such as full reconstruction models. [sent-236, score-0.985]

77 (2009), we compare the reconstructions produced from our automatic groups to those produced from gold cognate groups. [sent-239, score-0.903]

78 In our case, during bipartite matching the set X is the set of bigrams in the language being re-permuted, and Y is the union of bigrams in the other languages. [sent-244, score-0.153]

79 2 Experiment 1: Complete Data In this experiment, we know precisely how many cognate groups there are and that every cognate group has a word in each language. [sent-246, score-1.679]

80 We scrambled the 583 cognate groups in the Romance dataset and ran each method to convergence. [sent-248, score-0.873]

81 Besides the heuristic baseline, we tried our model-based approach using Unigrams, Bigrams and Anchored Unigrams, with and without learning the parametric edit distances. [sent-249, score-0.162]

82 When we did not use learning, we set the parameters of the edit distance to (0, -3, -4) for matches, substitutions, and deletions/insertions, respectively. [sent-250, score-0.197]

83 tc4cth enshtein refers to fixed parameter edit distance transducer. [sent-257, score-0.195]

84 racy measured in terms of the number of correctly, completely reconstructed cognate groups. [sent-272, score-0.796]

85 3 Experiment 2: Incomplete Data As a more realistic scenario, we consider the case where we do not know that all cognate groups have words in all languages. [sent-282, score-0.873]

86 The baseline needs to be augmented to support the fact that some words may not appear in all cognate groups. [sent-285, score-0.753]

87 This makes sense because the default edit distance parameter settings strongly favor exact matches, while the learned transducers learn more realistic substitution and deletion matrices, at the expense of making more mistakes. [sent-292, score-0.354]

88 Thus, more sophisticated transducers could learn better sound laws, which could translate into improved accuracy. [sent-298, score-0.18]

89 4 Experiment 3: Reconstructions As a final trial, we wanted to see how each automatically found cognate group faired as compared to the “true groups” for actual reconstruction of proto-words. [sent-300, score-0.867]

90 To evaluate, we matched each Latin word with the best possible cognate group for that word. [sent-307, score-0.806]

91 If two or three ofthe words in an constructed cognate group agreed, we assigned the Latin word associated with the true group to it. [sent-309, score-0.859]

92 As a kind of “skyline,” we compare to the edit distances reported in Bouchard-C oˆt´ e et al. [sent-312, score-0.195]

93 (2009), which was based on complete knowledge of the cognate groups. [sent-313, score-0.753]

94 On this task, our reconstructed cognate groups had an average edit distance of 3. [sent-314, score-1.084]

95 This compares favorably to the edit distances reported in Bouchard-C oˆt´ e et al. [sent-316, score-0.195]

96 (2009), who using oracle cognate assignments achieved an average Levenshtein distance of 3. [sent-317, score-0.835]

97 9 8 Conclusion We presented a new generative model of word lists that automatically finds cognate groups from scrambled vocabulary lists. [sent-319, score-0.945]

98 This model jointly models the origin, propagation, and evolution of cognate groups from a common root word. [sent-320, score-0.93]

99 Finally, we demonstrate that these automatically generated cognate groups can be used to automatically reconstruct proto-words faithfully, with a small increase in error. [sent-323, score-0.873]

100 Automatic prediction of cognate orthography using support vector machines. [sent-403, score-0.753]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cognate', 0.753), ('latin', 0.157), ('transducers', 0.138), ('edit', 0.133), ('groups', 0.12), ('survival', 0.109), ('transducer', 0.109), ('automata', 0.103), ('wd', 0.098), ('ancestral', 0.097), ('wa', 0.094), ('topology', 0.082), ('phonemes', 0.08), ('cognates', 0.079), ('bipartite', 0.079), ('phoneme', 0.077), ('dreyer', 0.075), ('italian', 0.075), ('message', 0.073), ('romance', 0.068), ('automaton', 0.066), ('mohri', 0.064), ('portuguese', 0.064), ('distances', 0.062), ('reconstruction', 0.061), ('unigram', 0.06), ('messages', 0.059), ('evolution', 0.057), ('levenshtein', 0.057), ('alignment', 0.055), ('group', 0.053), ('unigrams', 0.053), ('alphabet', 0.052), ('evolutionary', 0.051), ('positional', 0.049), ('historical', 0.049), ('anchored', 0.049), ('topologies', 0.048), ('expectation', 0.048), ('deletion', 0.048), ('languages', 0.047), ('assignments', 0.047), ('arg', 0.047), ('arc', 0.047), ('affinities', 0.045), ('alignx', 0.045), ('malay', 0.045), ('xwd', 0.045), ('parent', 0.045), ('eisner', 0.045), ('minimization', 0.043), ('kondrak', 0.043), ('reconstructed', 0.043), ('sound', 0.042), ('ancestor', 0.042), ('dice', 0.041), ('marginal', 0.04), ('spanish', 0.04), ('zz', 0.04), ('vowels', 0.039), ('comparative', 0.039), ('lists', 0.039), ('approximating', 0.038), ('distribution', 0.038), ('bigrams', 0.037), ('parameterized', 0.037), ('zw', 0.036), ('dead', 0.036), ('greek', 0.036), ('phylogenetic', 0.036), ('pthe', 0.036), ('variables', 0.035), ('insertion', 0.035), ('distance', 0.035), ('weighted', 0.035), ('transitions', 0.034), ('wi', 0.034), ('composition', 0.034), ('generative', 0.033), ('geometric', 0.033), ('changes', 0.032), ('daughter', 0.032), ('py', 0.032), ('linguists', 0.032), ('states', 0.031), ('marginals', 0.031), ('semiring', 0.031), ('bloomfield', 0.03), ('compposition', 0.03), ('enspure', 0.03), ('hearth', 0.03), ('minxw', 0.03), ('oncina', 0.03), ('reconstructions', 0.03), ('ringe', 0.03), ('vulgar', 0.03), ('daum', 0.03), ('parametric', 0.029), ('parameters', 0.029), ('fixed', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 116 acl-2010-Finding Cognate Groups Using Phylogenies

Author: David Hall ; Dan Klein

Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.

2 0.22179411 16 acl-2010-A Statistical Model for Lost Language Decipherment

Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight

Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.

3 0.11101656 170 acl-2010-Letter-Phoneme Alignment: An Exploration

Author: Sittichai Jiampojamarn ; Grzegorz Kondrak

Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.

4 0.09013129 97 acl-2010-Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

Author: Graeme Blackwood ; Adria de Gispert ; William Byrne

Abstract: This paper presents an efficient implementation of linearised lattice minimum Bayes-risk decoding using weighted finite state transducers. We introduce transducers to efficiently count lattice paths containing n-grams and use these to gather the required statistics. We show that these procedures can be implemented exactly through simple transformations of word sequences to sequences of n-grams. This yields a novel implementation of lattice minimum Bayes-risk decoding which is fast and exact even for very large lattices.

5 0.088689886 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers

Author: Jonathan May ; Kevin Knight ; Heiko Vogler

Abstract: Weighted tree transducers have been proposed as useful formal models for representing syntactic natural language processing applications, but there has been little description of inference algorithms for these automata beyond formal foundations. We give a detailed description of algorithms for application of cascades of weighted tree transducers to weighted tree acceptors, connecting formal theory with actual practice. Additionally, we present novel on-the-fly variants of these algorithms, and compare their performance on a syntax machine translation cascade based on (Yamada and Knight, 2001). 1 Motivation Weighted finite-state transducers have found recent favor as models of natural language (Mohri, 1997). In order to make actual use of systems built with these formalisms we must first calculate the set of possible weighted outputs allowed by the transducer given some input, which we call forward application, or the set of possible weighted inputs given some output, which we call backward application. After application we can do some inference on this result, such as determining its k highest weighted elements. We may also want to divide up our problems into manageable chunks, each represented by a transducer. As noted by Woods (1980), it is easier for designers to write several small transducers where each performs a simple transformation, rather than painstakingly construct a single complicated device. We would like to know, then, the result of transformation of input or output by a cascade of transducers, one operating after the other. As we will see, there are various strategies for approaching this problem. We will consider offline composition, bucket brigade applica- tion, and on-the-fly application. Application of cascades of weighted string transducers (WSTs) has been well-studied (Mohri, Heiko Vogler Technische Universit a¨t Dresden Institut f u¨r Theoretische Informatik 01062 Dresden, Germany he iko .vogle r@ tu-dre s den .de 1997). Less well-studied but of more recent interest is application of cascades of weighted tree transducers (WTTs). We tackle application of WTT cascades in this work, presenting: • • • explicit algorithms for application of WTT casceaxpdelisc novel algorithms for on-the-fly application of nWoTvTe lca alscgoardieths,m mansd f experiments comparing the performance of tehxepseer iamlgeonrtisthm cos.m 2 Strategies for the string case Before we discuss application of WTTs, it is helpful to recall the solution to this problem in the WST domain. We recall previous formal presentations of WSTs (Mohri, 1997) and note informally that they may be represented as directed graphs with designated start and end states and edges labeled with input symbols, output symbols, and weights.1 Fortunately, the solution for WSTs is practically trivial—we achieve application through a series of embedding, composition, and projection operations. Embedding is simply the act of representing a string or regular string language as an identity WST. Composition of WSTs, that is, generating a single WST that captures the transformations of two input WSTs used in sequence, is not at all trivial, but has been well covered in, e.g., (Mohri, 2009), where directly implementable algorithms can be found. Finally, projection is another trivial operation—the domain or range language can be obtained from a WST by ignoring the output or input symbols, respectively, on its arcs, and summing weights on otherwise identical arcs. By embedding an input, composing the result with the given WST, and projecting the result, forward application is accomplished.2 We are then left with a weighted string acceptor (WSA), essentially a weighted, labeled graph, which can be traversed R+1∪W {e+ as∞su}m,te ha thtro thuegh woeuitgh t hi osf p aa ppaetrh t ihsa cta wlceuilgahtetds a asre th ien prod∪uct { +of∞ ∞th}e, wtheaitgh thtes wofe i gtsh etd ogfes a, panatdh t ihsat c athlceu lwateeigdh ats so tfh ae (not necessarily finite) set T of paths is calculated as the sum of the weights of the paths of T. 2For backward applications, the roles of input and output are simply exchanged. 1058 ProceedingUsp opfs thaela 4, 8Stwhe Adnennu,a 1l1- M16ee Jtiunlgy o 2f0 t1h0e. A ?c ss2o0c1ia0ti Aosnso focria Ctioonm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsetisc 1s058–1066, a:a/1a:a/1 b:b/a.:5b/.1aa::ba//. 49Ea:ba/:a.6/.5 (a) InpAut strain:ag/ “a aB” emab:ae/d1dedC in an D(b) first Wa:b:aS/T./4. in cascadeE :c/.6Fa:d/.4 (c//).. second WFST in cascbaad::dde// identityA WSaT: (d)b:Oda/c:f.d3Al./6ai5n.:0c3e/.0ca:o7m/1pao :sBcdi/t.32o584n6a :p/1roa:cCa/h:db.3:/c6d.2/46.35b:a(/be.a)5 :BbA:/u.D1c/keDtbrigBadEDae:ba/p.49proach:ECDa:b/ a.:6b/5.(f)Rdc/Ae./0sD.u35b7aFl4t:c o/f.76ofcBld/iE.nD53e4F6orFbcud/.c0k7e3tapbCpd:cDli/cF.a12t38ion (g)dcI/n.A3i5tD46dcaF/lD.0oF7n3-thea f:lcBdy/E .b(12h:d)8c/O.3A5nD-4dtchF/.e0C -7f3EDlyFas:t /n.B9d-EDinF afterc /x.d3cp/6.l1o28ring C(i)ECDOcFE/.nA5-tD4hdcF/e.0- fl37ysdcta/n.35dB64-iEDnF afteBrc/Eb.3eFs6dtc/ pd./1a2.t83h46 asbeC nEDF found Cobm:bdap:c/:o/d.3/.sa6.5e05D:c tF.h0e7 tranaaaa:sd:c: cdd// /. u12..53c2846ers stacnd//d..A53-i46nDdc f.o.00r73 (f) Appaal:A:ybaD W..19ST (b)BB tEoD WST (a) aftercd p/..0/0..rDo5373Fj46ectionc o.12u28tcdg//o.A.35iDn64dgcF// e0C0dC73gEDeFFs of sBtEaDFteF ADdFc// FiBgBuEDFreF 1: Tc/ .h2.34dcr/6/e. 1e2 8 d/ iA. f53fD64edcF/ r. 0 eCC37nEDtF appBroBEaDFcFhesdc ./t2o.3d4c a..12p28plicatioCCnED tF/ h. A53r6D4ocd/uF/. g0073h cascBBaEdDFeFs ocdf/ .3W264dcS//.1.T282s. bydc w//..53e46ll-known aElFgorithdc/m./2.34s6 to e//.f.5f3i46cieCnEtFly finBdE the kd-c/ bedst/ p6aths. Because WSTs can be freely composed, extending application to operate on a cascade of WSTs is fairly trivial. The only question is one of composition order: whether to initially compose the cascade into a single transducer (an approach we call offline composition) or to compose the initial embedding with the first transducer, trim useless states, compose the result with the second, and so on (an approach we call bucket brigade). The appropriate strategy generally depends on the structure of the individual transducers. A third approach builds the result incrementally, as dictated by some algorithm that requests information about it. Such an approach, which we call on-the-fly, was described in (Pereira and Riley, 1997; Mohri, 2009; Mohri et al., 2000). If we can efficiently calculate the outgoing edges of a state of the result WSA on demand, without calculating all edges in the entire machine, we can maintain a stand-in for the result structure, a machine consisting at first of only the start state of the true result. As a calling algorithm (e.g., an implementation of Dijkstra’s algorithm) requests information about the result graph, such as the set of outgoing edges from a state, we replace the current stand-in with a richer version by adding the result of the request. The on-the-fly approach has a distinct advantage over the other two methods in that the entire result graph need not be built. A graphical representation of all three methods is presented in Figure 1. 3 AppCliEcdcF//a..53ti64on of treeB tFranscd/d./3.u264cers Now let us revisit these strategies in the setting of trees and tree transducers. Imagine we have a tree or set of trees as input that can be represented as a weighted regular tree grammar3 (WRTG) and a WTT that can transform that input with some weight. We would like to know the k-best trees the WTT can produce as output for that input, along with their weights. We already know of several methods for acquiring k-best trees from a WRTG (Huang and Chiang, 2005; Pauls and Klein, 2009), so we then must ask if, analogously to the string case, WTTs preserve recognizability4 and we can form an application WRTG. Before we begin, however, we must define WTTs and WRTGs. 3.1 Preliminaries5 A ranked alphabet is a finite set Σ such that every member σ ∈ Σ has a rank rk(σ) ∈ N. We cerayll ⊆ Σ, ∈k ∈ aNs t ahe r set rokf tσho)s ∈e σ ∈ Σe such that r⊆k(σ Σ), k= ∈k. NTh teh ese ste otf o vfa trhioasbele σs σi s∈ d eΣnoted X = {x1, x2, . . .} and is assumed to be disjnooitnetd df Xrom = any rank,e.d. a.}lp ahnadb iest aussseudm iend dth tios paper. We use to denote a symbol of rank 0 that is not iWn any e ra ⊥nk toed d eanlpohtaeb aet s yumsebdo lin o fth riasn paper. tA is tr neoet t ∈ TΣ is denoted σ(t1 , . . . , tk) where k ≥ 0, σ ∈ and t1, . . . , tk ∈ TΣ. F)o wr σ ∈ we mΣe(km) ⊥ Σ T(k), σ ∈ Σk(0 ≥) Σ 3This generates the same class of weighted tree languages as weighted tree automata, the direct analogue of WSAs, and is more useful for our purposes. 4A weighted tree language is recognizable iff it can be represented by a wrtg. 5The following formal definitions and notations are needed for understanding and reimplementation of the presented algorithms, but can be safely skipped on first reading and consulted when encountering an unfamiliar term. 1059 write σ ∈ TΣ as shorthand for σ() . For every set Sw rditiesjσo in ∈t f Trom Σ, let TΣ (S) = TΣ∪S, where, for all s ∈ S, rk(s) = 0. lW se ∈ d,e rfkin(es) th 0e. positions of a tree t = σ(t1, . . . , tk), for k 0, σ ∈ t1, . . . , tk ∈ TΣ, as a set pos(≥t) ⊂ N∗ s∈uch that {∈ε} T ∪ 1e t≤ p ois (≤t) k ⊂, ⊂v ∈ pTohse( tse)t =of {lεea}f ∪ po {siivtio |ns 1 l ≤v(t i) ≤⊆ k p,ovs(t ∈) apores t(hto)s}e. pTohseit sieotns o fv l a∈f p poossit(ito)n ssu lvch(t )th ⊆at pfoors tn)o ir ∈ th Nse, pvio ∈it ponoss(t v). ∈ We p presume hsta tnhadatr dfo lrex nioco igr ∈aph Nic, ovrid ∈eri pnogss( <∈ nTΣd ≤an odn v p o∈s pos(t). The label of t at Lpoestit ti,osn v, Tdenaontedd v v by ∈ t( pvo)s,( tt)he. sTuhbetr leaeb eolf ot fa tt v, denoted by t|v, and the replacement at v by s, vde,n doetneodt e bdy tb[ys] tv|, are defined as follows: ≥ pos(t) = ,{ aivs a| Σ(k), pos(ti)}. 1. For every σ ∈ Σ(0) , σ(ε) = σ, σ|ε = σ, and σF[osr]ε e v=e sy. 2. For every t = σ(t1 , . . . , tk) such that k = rk(σ) and k 1, t(ε) = σ, t|ε = t, aknd = t[ rsk]ε( =) ns.d kFo ≥r every 1) ≤= iσ ≤ t| k and v ∈ pos(ti), t(ivF) =r vtie (rvy) ,1 1t| ≤iv =i ≤ti |v k, aanndd tv[s] ∈iv p=o sσ(t(t1 , . . . , ti−1 , ti[(sv])v, , tt|i+1 , . . . , t|k). The size of a tree t, size (t) is |pos(t) |, the cardinTahliety s iozef i otsf apo tsrieteio tn, sseizt.e (Tt)he is s y |ipelods (ste)t| ,o tfh ae tcraereis the set of labels of its leaves: for a tree t, yd (t) = {t(v) | v ∈ lv(t)}. {Lt(etv )A | avn ∈d lBv( tb)e} sets. Let ϕ : A → TΣ (B) be Lae mt Aapp ainndg. B W bee seexttes.nd L ϕ t oϕ th :e A Am →appi Tng ϕ : TΣ (A) → TΣ (B) such that for a ∈ A, ϕ(a) = ϕ(a) and( Afo)r →k 0, σ ∈ Σch(k th) , atn fdo t1, . . . , tk ∈ TΣ (A), ϕan(dσ( fto1r, . . . ,t0k,) σ) =∈ σ Σ(ϕ(t1), . . . ,ϕ(tk)). ∈ W Te indicate such extensions by describing ϕ as a substitution mapping and then using ϕ without further comment. We use R+ to denote the set {w ∈ R | w 0} and R+∞ to dentoote d Ren+o ∪e {th+e∞ set}. { ≥ ≥ ≥ Definition 3.1 (cf. (Alexandrakis and Bozapalidis, 1987)) A weighted regular tree grammar (WRTG) is a 4-tuple G = (N, Σ, P, n0) where: 1. N is a finite set of nonterminals, with n0 ∈ N the start nonterminal. 2. Σ is a ranked alphabet of input symbols, where Σ ∩ N = ∅. 3. PΣ ∩is Na =tup ∅le. (P0, π), where P0 is a finite set of productions, each production p of the form n → u, n ∈ N, u ∈ TΣ(N), and π : P0 → R+ ins a→ →w uei,g nht ∈ ∈fu Nnc,ti uo n∈ o Tf the productions. W→e w Rill refer to P as a finite set of weighted productions, each production p of the form n −π −(p →) u. A production p is a chain production if it is of the form ni nj, where ni, nj ∈ N.6 − →w 6In (Alexandrakis and Bozapalidis, 1987), chain productions are forbidden in order to avoid infinite summations. We explicitly allow such summations. A WRTG G is in normal form if each production is either a chain production or is of the form n σ(n1, . . . , nk) where σ ∈ Σ(k) and n1, . . . , nk →∈ σ N(n. For WRTG∈ G N =. (N, Σ, P, n0), s, t, u ∈ TΣ(N), n ∈ N, and p ∈ P of the form n −→ ∈w T u, we nobt ∈ain N Na ,d aenridva ptio ∈n s Ptep o ffr tohme fso rtom mt n by− →repl ua,ci wneg some leaf nonterminal in s labeled n with u. For- − →w mally, s ⇒pG t if there exists some v ∈ lv(s) smuaclhly t,ha st s⇒(v) =t i fn t haenrde s e[xui]svt = so tm. e W ve say t(hsis) derivation step is leftmost if, for all v0 ∈ lv(s) where v0 < v, s(v0) ∈ Σ. We hencef∈orth lv a(ss-) sume all derivation )ste ∈ps a.re leftmost. If, for some m ∈ N, pi ∈ P, and ti ∈ TΣ (N) for all s1o m≤e i m m≤ ∈ m N, n0 ⇒∈ pP1 t a1n ⇒∈pm T tm, we say t1he ≤ sequence ,d n = (p1, . . . ,p.m ⇒) is a derivation of tm in G and that n0 ⇒∗ tm; the weight of d is wt(d) = π(p1) · . . . ⇒· π(pm). The weighted tree language rec)og ·n .i.z.ed · π by(p G is the mapping LG : TΣ → R+∞ such that for every t ∈ TΣ, LG(t) is the sum→ →of R the swuecihgth htsa to ffo arl el v(eproyss ti b∈ly T infinitely many) derivations of t in G. A weighted tree language f : TΣ → R+∞ is recognizable if there is a WRTG G such t→hat R f = LG. We define a partial ordering ? on WRTGs sucWh eth date finore W aR TpGarst aGl1 r=d r(iNng1 , Σ?, P o1n , n0) and G2 (N2, Σ, P2, n0), we say G1 ? G2 iff N1 ⊆ N2 and P1 ⊆ P2, where the w?eigh Gts are pres⊆erve Nd. ... = Definition 3.2 (cf. Def. 1of (Maletti, 2008)) A weighted extended top-down tree transducer (WXTT) is a 5-tuple M = (Q, Σ, ∆, R, q0) where: 1. Q is a finite set of states. 2. Σ and ∆ are the ranked alphabets of input and output symbols, respectively, where (Σ ∪ ∆) ∩ Q = 3. (RΣ i ∪s a∆ )tu ∩ple Q ( =R 0∅, .π), where R0 is a finite set of rules, each rule r of the form q.y → u for q ∈ ru lQes, y c∈h T ruΣle(X r), o fa tnhde u fo r∈m T q∆.y(Q − → →× u uX fo)r. Wqe ∈ ∈fu Qrt,hye r ∈req Tuire(X Xth)a,t annod v uari ∈abl Te x( Q∈ ×X appears rmthoerre rtehqauni roen tchea itn n y, aanrida bthleat x xe ∈ach X Xva arpi-able appearing in u is also in y. Moreover, π : R0 → R+∞ is a weight function of the rules. As →for RWRTGs, we refer to R as a finite set of weighted rules, each rule r of the form ∅. q.y −π −(r →) u. A WXTT is linear (respectively, nondeleting) if, for each rule r of the form q.y u, each x ∈ yd (y) ∩ X appears at most on− →ce ( ur,es epaecchxtive ∈ly, dat( lye)a ∩st Xonc aep) iena us. tW meo dsten oontcee th (ree scpleascsof all WXTTs as wxT and add the letters L and N to signify the subclasses of linear and nondeleting WTT, respectively. Additionally, if y is of the form σ(x1 , . . . , xk), we remove the letter “x” to signify − →w 1060 × ×× × the transducer is not extended (i.e., it is a “traditional” WTT (F¨ ul¨ op and Vogler, 2009)). For WXTT M = (Q, Σ, ∆, R, q0), s, t ∈ T∆(Q TΣ), and r ∈ R of the form q.y −w →), u, we obtain a× d Ter)iv,a atniodn r s ∈te pR ofrfom the s f trom mt b.yy r→epl ua,c wineg sbotamine leaf of s labeled with q and a tree matching y by a transformation of u, where each instance of a variable has been replaced by a corresponding subtree of the y-matching tree. Formally, s ⇒rM t if there oisf tah peo ysi-tmioantc vh n∈g tp roese(.s F)o, am saulblys,ti stu ⇒tion mapping ϕ : X → TΣ, and a rule q.y −u→w bs u ∈ R such that ϕs(v :) X X= → →(q, T ϕ(y)) and t = s[ϕ− →0(u u)] ∈v, wRh seurech hϕ t0h aist a substitution mapping Q X → T∆ (Q TΣ) dae sfiunbesdti usuticohn t mhaatp ϕpin0(qg0, Q Qx) × = X ( →q0, Tϕ(x()Q) f×or T all q0 ∈ Q and x ∈ X. We say this derivation step is l∈eft Qmo asnt dif, x f o∈r Xall. v W0 e∈ s lyv( tsh)i w deherirvea tvio0 n< s v, s(v0) ∈ ∆. We hencefor∈th lavs(sus)m we haellr ede vrivation steps) a ∈re ∆le.ftm Woes ht.e nIcf,e ffoorr sho amsesu sm ∈e aTllΣ d, emriv a∈t oNn, ri p∈s R ar, ea lnedf ttmi o∈s tT.∆ I f(,Q f ×r sToΣm) efo sr ∈all T T1 ≤, m mi ≤ ∈ m, (q0∈, s R) ,⇒ anrd1 tt1 . . . ⇒(rQm ×tm T, w)e f say lth 1e sequence d =, ()r1 ⇒ , . . . , rm..) .i s⇒ ⇒a derivation of (s, tm) in M; the weight of d is wt(d) = π(r1) · . . . · π(rm). The weighted tree transformation )r ·ec .o..gn ·i πze(rd by M is the mapping τM : TΣ T∆ → R+∞, such that for every s ∈ TΣ and t ∈× T T∆, τM→(s R, t) is the × µ× foofrth eve ewryeig sh ∈ts Tof aalln (dpo ts ∈sib Tly infinitely many) derivations of (s, t) in M. The composition of two weighted tree transformations τ : TΣ T∆ → R+∞ and : T∆ TΓ → R+∞ is the weight×edT tree→ →tra Rnsformation (τ×; Tµ) :→ →TΣ R TΓ → R+∞ wPhere for every s ∈ TΣ and u ∈ TΓ, (τ×; Tµ) (→s, uR) = Pt∈T∆ τ(s, t) · µ(t,u). 3.2 Applicable classes We now consider transducer classes where recognizability is preserved under application. Table 1 presents known results for the top-down tree transducer classes described in Section 3. 1. Unlike the string case, preservation of recognizability is not universal or symmetric. This is important for us, because we can only construct an application WRTG, i.e., a WRTG representing the result of application, if we can ensure that the language generated by application is in fact recognizable. Of the types under consideration, only wxLNT and wLNT preserve forward recognizability. The two classes marked as open questions and the other classes, which are superclasses of wNT, do not or are presumed not to. All subclasses of wxLT preserve backward recognizability.7 We do not consider cases where recognizability is not preserved tshuamt in the remainder of this paper. If a transducer M of a class that preserves forward recognizability is applied to a WRTG G, we can call the forward ap7Note that the introduction of weights limits recognizability preservation considerably. For example, (unweighted) xT preserves backward recognizability. plication WRTG M(G). and if M preserves backward recognizability, we can call the backward application WRTG M(G)/. Now that we have explained the application problem in the context of weighted tree transducers and determined the classes for which application is possible, let us consider how to build forward and backward application WRTGs. Our basic approach mimics that taken for WSTs by using an embed-compose-project strategy. As in string world, if we can embed the input in a transducer, compose with the given transducer, and project the result, we can obtain the application WRTG. Embedding a WRTG in a wLNT is a trivial operation—if the WRTG is in normal form and chain production-free,8 for every production of the form n − →w σ(n1 , . . . , nk), create a rule ofthe form n.σ(x1 , . . . , xk) − →w σ(n1 .x1, . . . , nk.xk). Range × projection of a w− x→LN σT(n is also trivial—for every q ∈ Q and u ∈ T∆ (Q X) create a production of the form q ∈−→w T u(0 where )u 0c is formed from u by replacing al−l → →lea uves of the form q.x with the leaf q, i.e., removing references to variables, and w is the sum of the weights of all rules of the form q.y → u in R.9 Domain projection for wxLT is bq.eyst →exp ulai inne dR b.y way of example. The left side of a rule is preserved, with variables leaves replaced by their associated states from the right side. So, the rule q1.σ(γ(x1) , x2) − →w δ(q2.x2, β(α, q3.x1)) would yield the production q1 q− →w σ(γ(q3) , q2) in the domain projection. Howev− →er, aσ dγe(lqeting rule such as q1.σ(x1 , x2) − →w γ(q2.x2) necessitates the introduction of a new →non γte(rqminal ⊥ that can genienrtartoed aullc toiof nT Σo fw ai nthe wwe niognhtte r1m . The only missing piece in our embed-composeproject strategy is composition. Algorithm 1, which is based on the declarative construction of Maletti (2006), generates the syntactic composition of a wxLT and a wLNT, a generalization of the basic composition construction of Baker (1979). It calls Algorithm 2, which determines the sequences of rules in the second transducer that match the right side of a single rule in the × first transducer. Since the embedded WRTG is of type wLNT, it may be either the first or second argument provided to Algorithm 1, depending on whether the application is forward or backward. We can thus use the embed-compose-project strategy for forward application of wLNT and backward application of wxLT and wxLNT. Note that we cannot use this strategy for forward applica8Without loss of generality we assume this is so, since standard algorithms exist to remove chain productions (Kuich, 1998; E´sik and Kuich, 2003; Mohri, 2009) and convert into normal form (Alexandrakis and Bozapalidis, 1987). 9Finitely many such productions may be formed. 1061 tion of wxLNT, even though that class preserves recognizability. Algorithm 1COMPOSE 1: inputs 2: wxLT M1 = (Q1, Σ, ∆, R1, q10 ) 3: wLNT M2 = (Q2, ∆, Γ, R2, q20 ) 4: outputs 5: wxLT M3 = ((Q1 Q2), Σ, Γ, R3, (q10 , q20 )) such that M3 = (τM1 ; τM2 Q). 6: complexity 7: O(|R1 | max(|R2|size( ˜u), |Q2|)), where ˜u is the × lOar(g|eRst |rimgahtx s(|idRe t|ree in a,n|yQ ru|l))e in R1 8: Let R3be of the form (R30,π) 9: R3 ← (∅, ∅) 10: Ξ ←← ←{ ((q∅,10∅ , q20 )} {seen states} 11 10 : ΨΞ ←← {{((qq10 , q20 ))}} {{speeennd sintagt essta}tes} 1112:: Ψwh ←ile {Ψ( ∅ do) 1123:: (ilqe1 , Ψq26 =) ← ∅ daony element of 14: ← Ψ) \← {a(nqy1 , ql2em)}e 15: for all (q1.y q− −w →1 u) ∈ R1 do 16: for all (z, −w − →2) u∈) )C ∈O RVER(u, M2, q2) do 17: for all (q, x) )∈ ∈∈ C yOdV V(Ez)R ∩(u u(,(QM1 Q2) X) do 18: i fa lql (∈q ,Ξx )th ∈en y 19: qΞ6 ∈ ← Ξ tΞh e∪n {q} 20: ΞΨ ←← ΞΨ ∪∪ {{qq}} 21: r ← ((Ψq1 ← , q 2Ψ) .y {→q }z) 22: rR30 ← ←← (( qR03 ∪ {).ry} 23: π(r)← ←← R π(∪r) { +r} (w1 · w2) 24: return M3 = Ψ 4 Ψ Application of tree transducer cascades What about the case of an input WRTG and a cascade of tree transducers? We will revisit the three strategies for accomplishing application discussed above for the string case. In order for offline composition to be a viable strategy, the transducers in the cascade must be closed under composition. Unfortunately, of the classes that preserve recognizability, only wLNT × is closed under composition (G´ ecseg and Steinby, 1984; Baker, 1979; Maletti et al., 2009; F ¨ul ¨op and Vogler, 2009). However, the general lack of composability of tree transducers does not preclude us from conducting forward application of a cascade. We revisit the bucket brigade approach, which in Section 2 appeared to be little more than a choice of composition order. As discussed previously, application of a single transducer involves an embedding, a composition, and a projection. The embedded WRTG is in the class wLNT, and the projection forms another WRTG. As long as every transducer in the cascade can be composed with a wLNT to its left or right, depending on the application type, application of a cascade is possible. Note that this embed-compose-project process is somewhat more burdensome than in the string case. For strings, application is obtained by a single embedding, a series of compositions, and a single projecAlgorithm 2 COVER 1: inputs 2: u ∈ T∆ (Q1 X) 3: wuT ∈ M T2 = (Q×2, X X∆), Γ, R2, q20 ) 4: state q2 ∈ Q2 ×× × 5: outputs 6: set of pairs (z, w) with z ∈ TΓ ((Q1 Q2) X) fsoetrm ofed p ab yir so (nze, ,o wr m) worieth hsu zcc ∈es Tsful runs× ×on Q Qu )b y × ×ru Xles) in R2, starting from q2, and w ∈ R+∞ the sum of the weights of all such runs,. 7: complexity 8: O(|R2|size(u)) 9: 10: 11: 12: 13: 14: if u(ε) is of the form (q1,x) ∈ Q1× X then zinit ← ((q1 q2), x) else zinit ← ⊥ Πlast ←← ←{(z ⊥init, {((ε, ε), q2)}, 1)} for all← v ∈ pos(u) εsu,εch), tqha)t} u(v) ∈ ∆(k) for some fko ≥r 0ll li nv p ∈ref ipxo osr(ude)r sduoc 15: ≥Π v0 i←n p ∅r 16: for ←all ∅(z, θ, w) ∈ Πlast do 17: rf aorll a(zll, vθ0, ∈w )lv ∈(z Π) such that z(v0) = ⊥ do 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: , dfoor all(θ(v,v0).u(v)(x1,...,xk) −w →0h)∈R2 θ0 ← θ For←m sθubstitution mapping ϕ : (Q2 X) → TΓ((Q1 Q2 X) {⊥}). f→or Ti = 1to× ×k dQo for all v00 ∈ pos(h) such that h(v00) = (q02 , xi∈) for some q20 ∈ Q2 do θ0(vi, v0v00) ← q20 if u(vi) ←is q of the form (q1, x) ∈ Q1 X then ∪ ,ϕ(x)q20 ∈, x Qi) ←× X((q t1h, eqn20), x) else ϕ(q20, xi) ← ⊥ Πv ← Πv {(z[ϕ)( ←h)] ⊥v0 , θ0, w · w0)} ∪ 29: Πlast ← Πv 30: Z ← {z |← ←(z Π, θ, Xw) 31: return {(z, X ∈ (z ,θ ,wX) X∈Πl Πlast} w) | z ∈ Z} ast X tion, whereas application for trees is obtained by a series of (embed, compose, project) operations. 4.1 On-the-fly algorithms We next consider on-the-fly algorithms for application. Similar to the string case, an on-thefly approach is driven by a calling algorithm that periodically needs to know the productions in a WRTG with a common left side nonterminal. The embed-compose-project approach produces an entire application WRTG before any inference algorithm is run. In order to admit an on-the-fly approach we describe algorithms that only generate those productions in a WRTG that have a given left nonterminal. In this section we extend Definition 3. 1 as follows: a WRTG is a 6tuple G = (N, P, n0,M,G) where N, P, and n0 are defined as in Definition 3. 1, and either M = G = ∅,10 or M is a wxLNT and G is a normMal = =fo Grm =, c ∅h,ain production-free WRTG such that Σ, 10In which case Σ, the definition is functionally unchanged from before. 1062 w t[xLypN] LeT (a)pPresOYNer Qovsateiodn?f(Grwe´ca(Fsd¨uMrg(lKeS¨coa punelgsid cowtihuSza,[rtxlbce1.2i]9,l0Nn2tybT091), 984w [txy](pbL Ne)T PrespvatiosYnNero svfebda?ckw(rFdu¨MlSeo¨c apesgl onwtiuza[r,xlcb.2]ei,NlL0t2yT 91)0 Table 1: Preservation of forward and backward recognizability for various classes of top-down tree transducers. Here and elsewhere, the following abbreviations apply: w = weighted, x = extended LHS, L = linear, N = nondeleting, OQ = open question. Square brackets include a superposition of classes. For example, w[x]T signifies both wxT and wT. Algorithm 3 PRODUCE 1: inputs 2: WRTG Gin = (Nin, ∆, Pin, n0, M, G) such that M = (Q, Σ, ∆, R, q0) is a wxLNT and G = (N, Σ, P, n00, M0, G0) is a WRTG in normal form with no chain productions 3: nin ∈ Nin 4: outputs∈ 5: WRTG Gout = (Nout, ∆, Pout, n0, M, G), such that Gin ? Gout and (nin ?−→w G u) ∈ Pout ⇔ (nin − →w u) ∈ M(G). 6: complex−i →ty 7: O(|R| ), where ˜y is the largest left side tree iOn (a|Rny| | rPul|e in R |P|size( y˜) 8: if Pincontains productions of the form nin− →w u then 9: return Gin 10: Nout ← Nin 11: Pout ←← P Nin 12: Let ni←n b Pe of the form (n, q), where n ∈ N and q ∈ Q. × × 13: for all (q.y −f −wt → 1he u) ∈ R do 14: for all (θ, w2) ∈ ∈RE RPL doACE(y,G, n) do 15: Form subs)ti ∈tu RtiEonP LmAaCppEi(nyg, Gϕ, n: Qo X → T∆ (N Q) such that, for all v ∈ ydQ Q(y) × ×and X Xq0 → →∈ Q, (ifN Nth ×ereQ e)x sisutc nh0 h∈a tN, f aonrd a lxl v∈ ∈X y sdu(cyh) th anatd θ q(v∈) = n0 and y(v) = x, t∈he Nn aϕn(dq0 x , x ∈) X= ( snu0c,h hq t0)ha. 16: p0 ((n, q) −w −1 −· −w →2 ϕ(u)) 17: for← ←all ( p ∈, qN)O− −R −M − →(p0 ϕ, N(uo)u)t) do ← 18: Let p b(ke) o.f the form n0− →w δ(n1,...,nk) for 19: δN ∈out ∆ ← Nout ∪ {n0 , . . . , nk } 20: Pout ←← P Nout ∪∪ { {pn} 21: return CHAIN-REM(Gout) M(G).. In the latter case, G is a stand-in for MG ?(G M).,( analogous to the stand-ins for WSAs and G ? WSTs described in Section 2. Algorithm 3, PRODUCE, takes as input a WRTG Gin = (Nin, ∆, Pin, n0, and a desired nonterminal nin and returns another WRTG, Gout that is different from Gin in that it has more productions, specifically those beginning with nin that are in Algorithms using stand-ins should call PRODUCE to ensure the stand-in they are using has the desired productions beginning with the specific nonterminal. Note, then, that M, G) M(G).. PRODUCE obtains the effect of forward applica- Algorithm 4 REPLACE 1: 2: 3: 4: 5: 6: 7: 8: inputs y ∈ TΣ(X) WRTG G = (N, Σ, P, n0, M, G) in normal form, with no chain productions n∈ N outnpu ∈ts N set Π of pairs (θ, w) where θ is a mapping pos(y) → N and w ∈ R+∞ , each pair indicating pa ossu(cyc)ess →ful Nrun a nodn wy b ∈y p Rroductions in G, starting from n, and w is the weight of the run. complexity O(|P|size(y)) 9: Πlast← {({(ε,n)},1)} 10: for all← ←v {∈( { po(εs,(ny)) s,u1c)h} that y(v) ∈ X in prefix order fdoor 11: Πv ← ∅ 12: for ←all ∅(θ, w) ∈ Πlast do 13: ri fa Mll ( w∅) )a ∈nd Π G ∅ then 14: MG ←= ∅PR anOdD GUC6 =E ∅(G th, eθn(v)) = = −w →0 15: for all (θ(v) y(v) (n1, . . . , nk)) ∈ P do 16: Πv ← Πv∪− →{(θ y∪({v ()(vni, ni) , 1≤ )i) ≤ ∈ k P}, d dwo·w0) } 17: Πlast ← Π←v 18: return Πlast Algorithm 5 MAKE-EXPLICIT 1: inputs 2: WRTG G = (N, Σ, P, n0, M, G) in normal form 3: outputs 4: WRTG G0 = (N0, Σ, P0, n0, M, G), in normal form, such that if M ∅ and G ∅, LG0 = LM(G)., and otherwise Gf M0 = G. = = 56:: comOp(|lePx0it|y) 7: G0← 8: Ξ ←← { nG0} {seen nonterminals} 89:: ΞΨ ←← {{nn0}} {{speeenndi nnogn tneornmteinramlsi}nals} 190:: wΨh ←ile {Ψn =} ∅{ pdeon 11 10 : inl e← Ψa6n =y ∅el deoment of 12: nΨ ←←a nΨy \ e l{emn}e 13: iΨf M ← ∅\ a{nnd} G ∅ then 14: MG0 =← ∅ P aRnOdD GU 6=CE ∅(G the0,n nn) 15: for all (n P−→w RO σ(n1 , . . . , nk)) ∈ P0 do 16: for i= 1→ →to σ (kn ndo 17: if ni ∈ Ξ then 18: Ξ ←∈ Ξ ΞΞ t h∪e {nni} 19: ΞΨ ←← ΞΨ ∪∪ {{nni}} 20: return G0 Ψ = = 1063 g0 g0 −w − →1 −−w →→2 g0 − − → σ(g0, g1) α w − →3 g1 − − → α (a) Input WRTG G G a0 a0.σ(x1, x2) −w − →4 − w − → →5 σ(a0.x1, a1.x2) a0.σ(x1, x2) ψ(a2.x1, a1.x2) a0 .α − − → α a 1.α − − → α a2 .α w − →6 (w −a → →7 w − →8 −−→ ρ (b) First transducer MA in the cascade b0 b0.σ(x1, x2) b0.α −w −1 →0 α −w − →9 σ(b0.x1, b0.x2) (c) Second transducer MB in the cascade g0a0 g0a0 −w −1 −· −w →4 σ(g0a0, g1a1) −−w −− 1− − ·w − − → →5 ψ(g0a2, g1a1) − − −·w − → α g1a1 − − −·w − → α w −− 2 −− − · w−− → →6 g0a0 w − 3 − −· w− → →7 (d) Productions of MA (G). built as a consequence of building the complete MB(MA(G).). g0a0b0 g0a0b0 −w −1 −· −w4 −·w − →9 σ(g0a0b0, g1a1b0) g0a0b0 −−w − − −2 −· −w −6 − −·−w − → −1 →0 σ α g1a1b0 −w − −3· w−7 −· −w −1 →0 α (e) Complete MB (MA (G).). Figure 2: Forward application through a cascade of tree transducers using an on-the-fly method. tion in an on-the-fly manner.11 It makes calls to REPLACE, which is presented in Algorithm 4, as well as to a NORM algorithm that ensures normal form by replacing a single production not in normal form with several normal-form productions that can be combined together (Alexandrakis and Bozapalidis, 1987) and a CHAIN-REM algorithm that replaces a WRTG containing chain productions with an equivalent WRTG that does not (Mohri, 2009). As an example of stand-in construction, consider the invocation PRODUCE(G1, g0a0), where iGs1 in= F (i{g u0rae0 2}a, 1 {2σa,nψd,α M,ρA},is ∅ i,n g 20ab0., T MheA s,ta Gn)d,-i Gn WRTG that is output contains the first three of the four productions in Figure 2d. To demonstrate the use of on-the-fly application in a cascade, we next show the effect of PRODUCE when used with the cascade G ◦ MA ◦ MB, wDhUeCreE MwhBe i uss eind wFitighu three c2acs. Oe uGr dMrivin◦gM algorithm in this case is Algorithm 5, MAKE11Note further that it allows forward application of class wxLNT, something the embed-compose-project approach did not allow. 12By convention the initial nonterminal and state are listed first in graphical depictions of WRTGs and WXTTs. rJJ.JJ(x1, x2, x3) → JJ(rDT.x1, rJJ.x2, rVB.x3) rVB.VB(x1, x2, )x− 3→) → JJ VrB(rNNPS.x1, rNN.x3, rVB.x2) t.”gentle” − → ”gentle”(a) Rotation rules iVB.NN(x1, x2) iVB.NN(x1, x2)) iVB.NN(x1, x2)) → →→ →→ NN(INS iNN.x1, iNN.x2) NNNN((iINNNS.x i1, iNN.x2) NNNN((iiNN.x1, iNN.x2, INS) (b) Insertion rules t.VB(x1 , x2, x3) → X(t.x1 , t.x2, t.x3) t.”gentleman” →) → j →1 t . ””ggeennttl eemmaann”” →→ jE1PS t . ”INgSen →tle m j 1a t . I NNSS →→ j 21 (c) Translation rules Figure 3: Example rules from transducers used in decoding experiment. j 1 and j2 are Japanese words. EXPLICIT, which simply generates the full application WRTG using calls to PRODUCE. The input to MAKE-EXPLICIT is G2 = ({g0a0b0}, {σ, α}, ∅, g0a0b0, MB, G1).13 MAKE=-E ({XgPLICI}T, c{aσl,lsα }P,R ∅O, gDUCE(G2, g0a0b0). PRODUCE then seeks to cover b0.σ(x1, x2) σ(b0.x1, b0.x2) with productions from G1, wh−i →ch i σs (ab stand-in for −w →9 MA(G).. At line 14 of REPLACE, G1 is improved so that it has the appropriate productions. The productions of MA(G). that must be built to form the complete MB (MA(G).). are shown in Figure 2d. The complete MB (MA(G).). is shown in Figure 2e. Note that because we used this on-the-fly approach, we were able to avoid building all the productions in MA(G).; in particular we did not build g0a2 − −w2 −· −w →8 ρ, while a bucket brigade approach would −h −a −v −e → →bui ρlt, ,t whish production. We have also designed an analogous onthe-fly PRODUCE algorithm for backward application on linear WTT. We have now defined several on-the-fly and bucket brigade algorithms, and also discussed the possibility of embed-compose-project and offline composition strategies to application of cascades of tree transducers. Tables 2a and 2b summarize the available methods of forward and backward application of cascades for recognizabilitypreserving tree transducer classes. 5 Decoding Experiments The main purpose of this paper has been to present novel algorithms for performing applica- tion. However, it is important to demonstrate these algorithms on real data. We thus demonstrate bucket-brigade and on-the-fly backward application on a typical NLP task cast as a cascade of wLNT. We adapt the Japanese-to-English transla13Note that G2 is the initial stand-in for MB (MA (G).)., since G1 is the initial stand-in for MA (G).. 1064 obomcbtfethodW√ √STwx√L× NTwL√ √NTo mbctbfethodW√ √STw×x√ LTw√ ×LTwxL√ ×NTwL√ √NT (a) Forward application (b) Backward application Table 2: Transducer types and available methods of forward and backward application of a cascade. oc = offline composition, bb = bucket brigade, otf = on the fly. tion model of Yamada and Knight (2001) by transforming it from an English-tree-to-Japanese-string model to an English-tree-to-Japanese-tree model. The Japanese trees are unlabeled, meaning they have syntactic structure but all nodes are labeled “X”. We then cast this modified model as a cascade of LNT tree transducers. Space does not permit a detailed description, but some example rules are in Figure 3. The rotation transducer R, a samparlee ionf Fwighuicreh 3is. Tinh Fei rgoutareti o3na, t rhaanss d6u,4c5e3r R ru,l eas s, tmheinsertion transducer I,Figure 3b, has 8,122 rules, iannsde rtthieon ntr trananssladtuiocne rtr Ia,n Fsidguucreer, 3 bT, , Fasig 8u,r1e2 32c r,u lheass, 3a7nd,31 th h1e ertu rlaenss. We add an English syntax language model L to theW ceas acdadde a no Ef ntrgalinsshd uscyentras x ju lastn gdueascgrei mbeodd etol L be ttoter simulate an actual machine translation decoding task. The language model is cast as an identity WTT and thus fits naturally into the experimental framework. In our experiments we try several different language models to demonstrate varying performance of the application algorithms. The most realistic language model is a PCFG. Each rule captures the probability of a particular sequence of child labels given a parent label. This model has 7,765 rules. To demonstrate more extreme cases of the usefulness of the on-the-fly approach, we build a language model that recognizes exactly the 2,087 trees in the training corpus, each with equal weight. It has 39,455 rules. Finally, to be ultraspecific, we include a form of the “specific” language model just described, but only allow the English counterpart of the particular Japanese sentence being decoded in the language. The goal in our experiments is to apply a single tree t backward through the cascade L◦R◦I◦T ◦t tarnede tfi bndac kthwe 1rd-b tehsrto pugathh hine tchaes caapdpeli Lca◦tiRon◦ IW◦RTTG ◦t. We evaluate the speed of each approach: bucket brigade and on-the-fly. The algorithm we use to obtain the 1-best path is a modification of the kbest algorithm of Pauls and Klein (2009). Our algorithm finds the 1-best path in a WRTG and admits an on-the-fly approach. The results of the experiments are shown in Table 3. As can be seen, on-the-fly application is generally faster than the bucket brigade, about double the speed per sentence in the traditional L1eMp-xcsafe tgcyn tpemb u eo ct hkfoe tdime>/.21s 0.e78465nms tenc Table 3: Timing results to obtain 1-best from application through a weighted tree transducer cascade, using on-the-fly vs. bucket brigade backward application techniques. pcfg = model recognizes any tree licensed by a pcfg built from observed data, exact = model recognizes each of 2,000+ trees with equal weight, 1-sent = model recognizes exactly one tree. experiment that uses an English PCFG language model. The results for the other two language models demonstrate more keenly the potential advantage that an on-the-fly approach provides—the simultaneous incorporation of information from all models allows application to be done more effectively than if each information source is considered in sequence. In the “exact” case, where a very large language model that simply recognizes each of the 2,087 trees in the training corpus is used, the final application is so large that it overwhelms the resources of a 4gb MacBook Pro, while the on-the-fly approach does not suffer from this problem. The “1-sent” case is presented to demonstrate the ripple effect caused by using on-the fly. In the other two cases, a very large language model generally overwhelms the timing statistics, regardless of the method being used. But a language model that represents exactly one sentence is very small, and thus the effects of simultaneous inference are readily apparent—the time to retrieve the 1-best sentence is reduced by two orders of magnitude in this experiment. 6 Conclusion We have presented algorithms for forward and backward application of weighted tree transducer cascades, including on-the-fly variants, and demonstrated the benefit of an on-the-fly approach to application. We note that a more formal approach to application of WTTs is being developed, 1065 independent from these efforts, by F ¨ul ¨op (2010). et al. Acknowledgments We are grateful for extensive discussions with Andreas Maletti. We also appreciate the insights and advice of David Chiang, Steve DeNeefe, and others at ISI in the preparation of this work. Jonathan May and Kevin Knight were supported by NSF grants IIS-0428020 and IIS0904684. Heiko Vogler was supported by DFG VO 1011/5-1. References Athanasios Alexandrakis and Symeon Bozapalidis. 1987. Weighted grammars and Kleene’s theorem. Information Processing Letters, 24(1): 1–4. Brenda S. Baker. 1979. Composition of top-down and bottom-up tree transductions. Information and Control, 41(2): 186–213. Zolt a´n E´sik and Werner Kuich. 2003. Formal tree series. Journal of Automata, Languages and Combinatorics, 8(2):219–285. Zolt a´n F ¨ul ¨op and Heiko Vogler. 2009. Weighted tree automata and tree transducers. In Manfred Droste, Werner Kuich, and Heiko Vogler, editors, Handbook of Weighted Automata, chapter 9, pages 3 13–404. Springer-Verlag. Zolt a´n F ¨ul ¨op, Andreas Maletti, and Heiko Vogler. 2010. Backward and forward application of weighted extended tree transducers. Unpublished manuscript. Ferenc G ´ecseg and Magnus Steinby. 1984. Tree Automata. Akad e´miai Kiad o´, Budapest. Liang Huang and David Chiang. 2005. Better k-best parsing. In Harry Bunt, Robert Malouf, and Alon Lavie, editors, Proceedings of the Ninth International Workshop on Parsing Technologies (IWPT), pages 53–64, Vancouver, October. Association for Computational Linguistics. Werner Kuich. 1998. Formal power series over trees. In Symeon Bozapalidis, editor, Proceedings of the 3rd International Conference on Developments in Language Theory (DLT), pages 61–101, Thessaloniki, Greece. Aristotle University of Thessaloniki. Werner Kuich. 1999. Tree transducers and formal tree series. Acta Cybernetica, 14: 135–149. Andreas Maletti, Jonathan Graehl, Mark Hopkins, and Kevin Knight. 2009. The power of extended topdown tree transducers. SIAM Journal on Computing, 39(2):410–430. Andreas Maletti. 2006. Compositions of tree series transformations. Theoretical Computer Science, 366:248–271. Andreas Maletti. 2008. Compositions of extended topdown tree transducers. Information and Computation, 206(9–10): 1187–1 196. Andreas Maletti. 2009. Personal Communication. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2000. The design principles of a weighted finite-state transducer library. Theoretical Computer Science, 231: 17–32. Mehryar Mohri. 1997. Finite-state transducers in language and speech processing. Computational Lin- guistics, 23(2):269–312. Mehryar Mohri. 2009. Weighted automata algorithms. In Manfred Droste, Werner Kuich, and Heiko Vogler, editors, Handbook of Weighted Automata, chapter 6, pages 213–254. Springer-Verlag. Adam Pauls and Dan Klein. 2009. K-best A* parsing. In Keh-Yih Su, Jian Su, Janyce Wiebe, and Haizhou Li, editors, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 958–966, Suntec, Singapore, August. Association for Computational Linguistics. Fernando Pereira and Michael Riley. 1997. Speech recognition by composition of weighted finite automata. In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Processing, chapter 15, pages 431–453. MIT Press, Cambridge, MA. William A. Woods. 1980. Cascaded ATN grammars. American Journal of Computational Linguistics, 6(1): 1–12. Kenji Yamada and Kevin Knight. 2001. A syntaxbased statistical translation model. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 523–530, Toulouse, France, July. Association for Computational Linguistics. 1066

6 0.08087083 195 acl-2010-Phylogenetic Grammar Induction

7 0.080659889 41 acl-2010-Automatic Selectional Preference Acquisition for Latin Verbs

8 0.079930991 21 acl-2010-A Tree Transducer Model for Synchronous Tree-Adjoining Grammars

9 0.06609454 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

10 0.063622288 133 acl-2010-Hierarchical Search for Word Alignment

11 0.058419041 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging

12 0.058230735 30 acl-2010-An Open-Source Package for Recognizing Textual Entailment

13 0.057781231 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

14 0.056316238 162 acl-2010-Learning Common Grammar from Multilingual Corpus

15 0.054273706 214 acl-2010-Sparsity in Dependency Grammar Induction

16 0.0527835 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

17 0.049014721 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

18 0.048060488 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

19 0.044501893 66 acl-2010-Compositional Matrix-Space Models of Language

20 0.044259425 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.149), (1, -0.041), (2, -0.013), (3, -0.033), (4, 0.009), (5, -0.022), (6, 0.027), (7, -0.015), (8, 0.099), (9, -0.081), (10, -0.113), (11, 0.026), (12, 0.059), (13, -0.07), (14, -0.058), (15, -0.083), (16, -0.027), (17, 0.028), (18, 0.017), (19, 0.007), (20, -0.04), (21, -0.056), (22, -0.026), (23, -0.087), (24, 0.081), (25, -0.12), (26, -0.059), (27, -0.073), (28, -0.027), (29, 0.048), (30, -0.01), (31, 0.016), (32, 0.151), (33, -0.08), (34, -0.09), (35, -0.15), (36, 0.032), (37, -0.023), (38, 0.045), (39, -0.03), (40, 0.041), (41, 0.016), (42, 0.019), (43, -0.088), (44, -0.151), (45, -0.008), (46, 0.159), (47, -0.045), (48, -0.141), (49, -0.137)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90697402 116 acl-2010-Finding Cognate Groups Using Phylogenies

Author: David Hall ; Dan Klein

Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.

2 0.7890169 16 acl-2010-A Statistical Model for Lost Language Decipherment

Author: Benjamin Snyder ; Regina Barzilay ; Kevin Knight

Abstract: In this paper we propose a method for the automatic decipherment of lost languages. Given a non-parallel corpus in a known related language, our model produces both alphabetic mappings and translations of words into their corresponding cognates. We employ a non-parametric Bayesian framework to simultaneously capture both low-level character mappings and highlevel morphemic correspondences. This formulation enables us to encode some of the linguistic intuitions that have guided human decipherers. When applied to the ancient Semitic language Ugaritic, the model correctly maps 29 of 30 letters to their Hebrew counterparts, and deduces the correct Hebrew cognate for 60% of the Ugaritic words which have cognates in Hebrew.

3 0.58215535 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers

Author: Jonathan May ; Kevin Knight ; Heiko Vogler

Abstract: Weighted tree transducers have been proposed as useful formal models for representing syntactic natural language processing applications, but there has been little description of inference algorithms for these automata beyond formal foundations. We give a detailed description of algorithms for application of cascades of weighted tree transducers to weighted tree acceptors, connecting formal theory with actual practice. Additionally, we present novel on-the-fly variants of these algorithms, and compare their performance on a syntax machine translation cascade based on (Yamada and Knight, 2001). 1 Motivation Weighted finite-state transducers have found recent favor as models of natural language (Mohri, 1997). In order to make actual use of systems built with these formalisms we must first calculate the set of possible weighted outputs allowed by the transducer given some input, which we call forward application, or the set of possible weighted inputs given some output, which we call backward application. After application we can do some inference on this result, such as determining its k highest weighted elements. We may also want to divide up our problems into manageable chunks, each represented by a transducer. As noted by Woods (1980), it is easier for designers to write several small transducers where each performs a simple transformation, rather than painstakingly construct a single complicated device. We would like to know, then, the result of transformation of input or output by a cascade of transducers, one operating after the other. As we will see, there are various strategies for approaching this problem. We will consider offline composition, bucket brigade applica- tion, and on-the-fly application. Application of cascades of weighted string transducers (WSTs) has been well-studied (Mohri, Heiko Vogler Technische Universit a¨t Dresden Institut f u¨r Theoretische Informatik 01062 Dresden, Germany he iko .vogle r@ tu-dre s den .de 1997). Less well-studied but of more recent interest is application of cascades of weighted tree transducers (WTTs). We tackle application of WTT cascades in this work, presenting: • • • explicit algorithms for application of WTT casceaxpdelisc novel algorithms for on-the-fly application of nWoTvTe lca alscgoardieths,m mansd f experiments comparing the performance of tehxepseer iamlgeonrtisthm cos.m 2 Strategies for the string case Before we discuss application of WTTs, it is helpful to recall the solution to this problem in the WST domain. We recall previous formal presentations of WSTs (Mohri, 1997) and note informally that they may be represented as directed graphs with designated start and end states and edges labeled with input symbols, output symbols, and weights.1 Fortunately, the solution for WSTs is practically trivial—we achieve application through a series of embedding, composition, and projection operations. Embedding is simply the act of representing a string or regular string language as an identity WST. Composition of WSTs, that is, generating a single WST that captures the transformations of two input WSTs used in sequence, is not at all trivial, but has been well covered in, e.g., (Mohri, 2009), where directly implementable algorithms can be found. Finally, projection is another trivial operation—the domain or range language can be obtained from a WST by ignoring the output or input symbols, respectively, on its arcs, and summing weights on otherwise identical arcs. By embedding an input, composing the result with the given WST, and projecting the result, forward application is accomplished.2 We are then left with a weighted string acceptor (WSA), essentially a weighted, labeled graph, which can be traversed R+1∪W {e+ as∞su}m,te ha thtro thuegh woeuitgh t hi osf p aa ppaetrh t ihsa cta wlceuilgahtetds a asre th ien prod∪uct { +of∞ ∞th}e, wtheaitgh thtes wofe i gtsh etd ogfes a, panatdh t ihsat c athlceu lwateeigdh ats so tfh ae (not necessarily finite) set T of paths is calculated as the sum of the weights of the paths of T. 2For backward applications, the roles of input and output are simply exchanged. 1058 ProceedingUsp opfs thaela 4, 8Stwhe Adnennu,a 1l1- M16ee Jtiunlgy o 2f0 t1h0e. A ?c ss2o0c1ia0ti Aosnso focria Ctioonm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsetisc 1s058–1066, a:a/1a:a/1 b:b/a.:5b/.1aa::ba//. 49Ea:ba/:a.6/.5 (a) InpAut strain:ag/ “a aB” emab:ae/d1dedC in an D(b) first Wa:b:aS/T./4. in cascadeE :c/.6Fa:d/.4 (c//).. second WFST in cascbaad::dde// identityA WSaT: (d)b:Oda/c:f.d3Al./6ai5n.:0c3e/.0ca:o7m/1pao :sBcdi/t.32o584n6a :p/1roa:cCa/h:db.3:/c6d.2/46.35b:a(/be.a)5 :BbA:/u.D1c/keDtbrigBadEDae:ba/p.49proach:ECDa:b/ a.:6b/5.(f)Rdc/Ae./0sD.u35b7aFl4t:c o/f.76ofcBld/iE.nD53e4F6orFbcud/.c0k7e3tapbCpd:cDli/cF.a12t38ion (g)dcI/n.A3i5tD46dcaF/lD.0oF7n3-thea f:lcBdy/E .b(12h:d)8c/O.3A5nD-4dtchF/.e0C -7f3EDlyFas:t /n.B9d-EDinF afterc /x.d3cp/6.l1o28ring C(i)ECDOcFE/.nA5-tD4hdcF/e.0- fl37ysdcta/n.35dB64-iEDnF afteBrc/Eb.3eFs6dtc/ pd./1a2.t83h46 asbeC nEDF found Cobm:bdap:c/:o/d.3/.sa6.5e05D:c tF.h0e7 tranaaaa:sd:c: cdd// /. u12..53c2846ers stacnd//d..A53-i46nDdc f.o.00r73 (f) Appaal:A:ybaD W..19ST (b)BB tEoD WST (a) aftercd p/..0/0..rDo5373Fj46ectionc o.12u28tcdg//o.A.35iDn64dgcF// e0C0dC73gEDeFFs of sBtEaDFteF ADdFc// FiBgBuEDFreF 1: Tc/ .h2.34dcr/6/e. 1e2 8 d/ iA. f53fD64edcF/ r. 0 eCC37nEDtF appBroBEaDFcFhesdc ./t2o.3d4c a..12p28plicatioCCnED tF/ h. A53r6D4ocd/uF/. g0073h cascBBaEdDFeFs ocdf/ .3W264dcS//.1.T282s. bydc w//..53e46ll-known aElFgorithdc/m./2.34s6 to e//.f.5f3i46cieCnEtFly finBdE the kd-c/ bedst/ p6aths. Because WSTs can be freely composed, extending application to operate on a cascade of WSTs is fairly trivial. The only question is one of composition order: whether to initially compose the cascade into a single transducer (an approach we call offline composition) or to compose the initial embedding with the first transducer, trim useless states, compose the result with the second, and so on (an approach we call bucket brigade). The appropriate strategy generally depends on the structure of the individual transducers. A third approach builds the result incrementally, as dictated by some algorithm that requests information about it. Such an approach, which we call on-the-fly, was described in (Pereira and Riley, 1997; Mohri, 2009; Mohri et al., 2000). If we can efficiently calculate the outgoing edges of a state of the result WSA on demand, without calculating all edges in the entire machine, we can maintain a stand-in for the result structure, a machine consisting at first of only the start state of the true result. As a calling algorithm (e.g., an implementation of Dijkstra’s algorithm) requests information about the result graph, such as the set of outgoing edges from a state, we replace the current stand-in with a richer version by adding the result of the request. The on-the-fly approach has a distinct advantage over the other two methods in that the entire result graph need not be built. A graphical representation of all three methods is presented in Figure 1. 3 AppCliEcdcF//a..53ti64on of treeB tFranscd/d./3.u264cers Now let us revisit these strategies in the setting of trees and tree transducers. Imagine we have a tree or set of trees as input that can be represented as a weighted regular tree grammar3 (WRTG) and a WTT that can transform that input with some weight. We would like to know the k-best trees the WTT can produce as output for that input, along with their weights. We already know of several methods for acquiring k-best trees from a WRTG (Huang and Chiang, 2005; Pauls and Klein, 2009), so we then must ask if, analogously to the string case, WTTs preserve recognizability4 and we can form an application WRTG. Before we begin, however, we must define WTTs and WRTGs. 3.1 Preliminaries5 A ranked alphabet is a finite set Σ such that every member σ ∈ Σ has a rank rk(σ) ∈ N. We cerayll ⊆ Σ, ∈k ∈ aNs t ahe r set rokf tσho)s ∈e σ ∈ Σe such that r⊆k(σ Σ), k= ∈k. NTh teh ese ste otf o vfa trhioasbele σs σi s∈ d eΣnoted X = {x1, x2, . . .} and is assumed to be disjnooitnetd df Xrom = any rank,e.d. a.}lp ahnadb iest aussseudm iend dth tios paper. We use to denote a symbol of rank 0 that is not iWn any e ra ⊥nk toed d eanlpohtaeb aet s yumsebdo lin o fth riasn paper. tA is tr neoet t ∈ TΣ is denoted σ(t1 , . . . , tk) where k ≥ 0, σ ∈ and t1, . . . , tk ∈ TΣ. F)o wr σ ∈ we mΣe(km) ⊥ Σ T(k), σ ∈ Σk(0 ≥) Σ 3This generates the same class of weighted tree languages as weighted tree automata, the direct analogue of WSAs, and is more useful for our purposes. 4A weighted tree language is recognizable iff it can be represented by a wrtg. 5The following formal definitions and notations are needed for understanding and reimplementation of the presented algorithms, but can be safely skipped on first reading and consulted when encountering an unfamiliar term. 1059 write σ ∈ TΣ as shorthand for σ() . For every set Sw rditiesjσo in ∈t f Trom Σ, let TΣ (S) = TΣ∪S, where, for all s ∈ S, rk(s) = 0. lW se ∈ d,e rfkin(es) th 0e. positions of a tree t = σ(t1, . . . , tk), for k 0, σ ∈ t1, . . . , tk ∈ TΣ, as a set pos(≥t) ⊂ N∗ s∈uch that {∈ε} T ∪ 1e t≤ p ois (≤t) k ⊂, ⊂v ∈ pTohse( tse)t =of {lεea}f ∪ po {siivtio |ns 1 l ≤v(t i) ≤⊆ k p,ovs(t ∈) apores t(hto)s}e. pTohseit sieotns o fv l a∈f p poossit(ito)n ssu lvch(t )th ⊆at pfoors tn)o ir ∈ th Nse, pvio ∈it ponoss(t v). ∈ We p presume hsta tnhadatr dfo lrex nioco igr ∈aph Nic, ovrid ∈eri pnogss( <∈ nTΣd ≤an odn v p o∈s pos(t). The label of t at Lpoestit ti,osn v, Tdenaontedd v v by ∈ t( pvo)s,( tt)he. sTuhbetr leaeb eolf ot fa tt v, denoted by t|v, and the replacement at v by s, vde,n doetneodt e bdy tb[ys] tv|, are defined as follows: ≥ pos(t) = ,{ aivs a| Σ(k), pos(ti)}. 1. For every σ ∈ Σ(0) , σ(ε) = σ, σ|ε = σ, and σF[osr]ε e v=e sy. 2. For every t = σ(t1 , . . . , tk) such that k = rk(σ) and k 1, t(ε) = σ, t|ε = t, aknd = t[ rsk]ε( =) ns.d kFo ≥r every 1) ≤= iσ ≤ t| k and v ∈ pos(ti), t(ivF) =r vtie (rvy) ,1 1t| ≤iv =i ≤ti |v k, aanndd tv[s] ∈iv p=o sσ(t(t1 , . . . , ti−1 , ti[(sv])v, , tt|i+1 , . . . , t|k). The size of a tree t, size (t) is |pos(t) |, the cardinTahliety s iozef i otsf apo tsrieteio tn, sseizt.e (Tt)he is s y |ipelods (ste)t| ,o tfh ae tcraereis the set of labels of its leaves: for a tree t, yd (t) = {t(v) | v ∈ lv(t)}. {Lt(etv )A | avn ∈d lBv( tb)e} sets. Let ϕ : A → TΣ (B) be Lae mt Aapp ainndg. B W bee seexttes.nd L ϕ t oϕ th :e A Am →appi Tng ϕ : TΣ (A) → TΣ (B) such that for a ∈ A, ϕ(a) = ϕ(a) and( Afo)r →k 0, σ ∈ Σch(k th) , atn fdo t1, . . . , tk ∈ TΣ (A), ϕan(dσ( fto1r, . . . ,t0k,) σ) =∈ σ Σ(ϕ(t1), . . . ,ϕ(tk)). ∈ W Te indicate such extensions by describing ϕ as a substitution mapping and then using ϕ without further comment. We use R+ to denote the set {w ∈ R | w 0} and R+∞ to dentoote d Ren+o ∪e {th+e∞ set}. { ≥ ≥ ≥ Definition 3.1 (cf. (Alexandrakis and Bozapalidis, 1987)) A weighted regular tree grammar (WRTG) is a 4-tuple G = (N, Σ, P, n0) where: 1. N is a finite set of nonterminals, with n0 ∈ N the start nonterminal. 2. Σ is a ranked alphabet of input symbols, where Σ ∩ N = ∅. 3. PΣ ∩is Na =tup ∅le. (P0, π), where P0 is a finite set of productions, each production p of the form n → u, n ∈ N, u ∈ TΣ(N), and π : P0 → R+ ins a→ →w uei,g nht ∈ ∈fu Nnc,ti uo n∈ o Tf the productions. W→e w Rill refer to P as a finite set of weighted productions, each production p of the form n −π −(p →) u. A production p is a chain production if it is of the form ni nj, where ni, nj ∈ N.6 − →w 6In (Alexandrakis and Bozapalidis, 1987), chain productions are forbidden in order to avoid infinite summations. We explicitly allow such summations. A WRTG G is in normal form if each production is either a chain production or is of the form n σ(n1, . . . , nk) where σ ∈ Σ(k) and n1, . . . , nk →∈ σ N(n. For WRTG∈ G N =. (N, Σ, P, n0), s, t, u ∈ TΣ(N), n ∈ N, and p ∈ P of the form n −→ ∈w T u, we nobt ∈ain N Na ,d aenridva ptio ∈n s Ptep o ffr tohme fso rtom mt n by− →repl ua,ci wneg some leaf nonterminal in s labeled n with u. For- − →w mally, s ⇒pG t if there exists some v ∈ lv(s) smuaclhly t,ha st s⇒(v) =t i fn t haenrde s e[xui]svt = so tm. e W ve say t(hsis) derivation step is leftmost if, for all v0 ∈ lv(s) where v0 < v, s(v0) ∈ Σ. We hencef∈orth lv a(ss-) sume all derivation )ste ∈ps a.re leftmost. If, for some m ∈ N, pi ∈ P, and ti ∈ TΣ (N) for all s1o m≤e i m m≤ ∈ m N, n0 ⇒∈ pP1 t a1n ⇒∈pm T tm, we say t1he ≤ sequence ,d n = (p1, . . . ,p.m ⇒) is a derivation of tm in G and that n0 ⇒∗ tm; the weight of d is wt(d) = π(p1) · . . . ⇒· π(pm). The weighted tree language rec)og ·n .i.z.ed · π by(p G is the mapping LG : TΣ → R+∞ such that for every t ∈ TΣ, LG(t) is the sum→ →of R the swuecihgth htsa to ffo arl el v(eproyss ti b∈ly T infinitely many) derivations of t in G. A weighted tree language f : TΣ → R+∞ is recognizable if there is a WRTG G such t→hat R f = LG. We define a partial ordering ? on WRTGs sucWh eth date finore W aR TpGarst aGl1 r=d r(iNng1 , Σ?, P o1n , n0) and G2 (N2, Σ, P2, n0), we say G1 ? G2 iff N1 ⊆ N2 and P1 ⊆ P2, where the w?eigh Gts are pres⊆erve Nd. ... = Definition 3.2 (cf. Def. 1of (Maletti, 2008)) A weighted extended top-down tree transducer (WXTT) is a 5-tuple M = (Q, Σ, ∆, R, q0) where: 1. Q is a finite set of states. 2. Σ and ∆ are the ranked alphabets of input and output symbols, respectively, where (Σ ∪ ∆) ∩ Q = 3. (RΣ i ∪s a∆ )tu ∩ple Q ( =R 0∅, .π), where R0 is a finite set of rules, each rule r of the form q.y → u for q ∈ ru lQes, y c∈h T ruΣle(X r), o fa tnhde u fo r∈m T q∆.y(Q − → →× u uX fo)r. Wqe ∈ ∈fu Qrt,hye r ∈req Tuire(X Xth)a,t annod v uari ∈abl Te x( Q∈ ×X appears rmthoerre rtehqauni roen tchea itn n y, aanrida bthleat x xe ∈ach X Xva arpi-able appearing in u is also in y. Moreover, π : R0 → R+∞ is a weight function of the rules. As →for RWRTGs, we refer to R as a finite set of weighted rules, each rule r of the form ∅. q.y −π −(r →) u. A WXTT is linear (respectively, nondeleting) if, for each rule r of the form q.y u, each x ∈ yd (y) ∩ X appears at most on− →ce ( ur,es epaecchxtive ∈ly, dat( lye)a ∩st Xonc aep) iena us. tW meo dsten oontcee th (ree scpleascsof all WXTTs as wxT and add the letters L and N to signify the subclasses of linear and nondeleting WTT, respectively. Additionally, if y is of the form σ(x1 , . . . , xk), we remove the letter “x” to signify − →w 1060 × ×× × the transducer is not extended (i.e., it is a “traditional” WTT (F¨ ul¨ op and Vogler, 2009)). For WXTT M = (Q, Σ, ∆, R, q0), s, t ∈ T∆(Q TΣ), and r ∈ R of the form q.y −w →), u, we obtain a× d Ter)iv,a atniodn r s ∈te pR ofrfom the s f trom mt b.yy r→epl ua,c wineg sbotamine leaf of s labeled with q and a tree matching y by a transformation of u, where each instance of a variable has been replaced by a corresponding subtree of the y-matching tree. Formally, s ⇒rM t if there oisf tah peo ysi-tmioantc vh n∈g tp roese(.s F)o, am saulblys,ti stu ⇒tion mapping ϕ : X → TΣ, and a rule q.y −u→w bs u ∈ R such that ϕs(v :) X X= → →(q, T ϕ(y)) and t = s[ϕ− →0(u u)] ∈v, wRh seurech hϕ t0h aist a substitution mapping Q X → T∆ (Q TΣ) dae sfiunbesdti usuticohn t mhaatp ϕpin0(qg0, Q Qx) × = X ( →q0, Tϕ(x()Q) f×or T all q0 ∈ Q and x ∈ X. We say this derivation step is l∈eft Qmo asnt dif, x f o∈r Xall. v W0 e∈ s lyv( tsh)i w deherirvea tvio0 n< s v, s(v0) ∈ ∆. We hencefor∈th lavs(sus)m we haellr ede vrivation steps) a ∈re ∆le.ftm Woes ht.e nIcf,e ffoorr sho amsesu sm ∈e aTllΣ d, emriv a∈t oNn, ri p∈s R ar, ea lnedf ttmi o∈s tT.∆ I f(,Q f ×r sToΣm) efo sr ∈all T T1 ≤, m mi ≤ ∈ m, (q0∈, s R) ,⇒ anrd1 tt1 . . . ⇒(rQm ×tm T, w)e f say lth 1e sequence d =, ()r1 ⇒ , . . . , rm..) .i s⇒ ⇒a derivation of (s, tm) in M; the weight of d is wt(d) = π(r1) · . . . · π(rm). The weighted tree transformation )r ·ec .o..gn ·i πze(rd by M is the mapping τM : TΣ T∆ → R+∞, such that for every s ∈ TΣ and t ∈× T T∆, τM→(s R, t) is the × µ× foofrth eve ewryeig sh ∈ts Tof aalln (dpo ts ∈sib Tly infinitely many) derivations of (s, t) in M. The composition of two weighted tree transformations τ : TΣ T∆ → R+∞ and : T∆ TΓ → R+∞ is the weight×edT tree→ →tra Rnsformation (τ×; Tµ) :→ →TΣ R TΓ → R+∞ wPhere for every s ∈ TΣ and u ∈ TΓ, (τ×; Tµ) (→s, uR) = Pt∈T∆ τ(s, t) · µ(t,u). 3.2 Applicable classes We now consider transducer classes where recognizability is preserved under application. Table 1 presents known results for the top-down tree transducer classes described in Section 3. 1. Unlike the string case, preservation of recognizability is not universal or symmetric. This is important for us, because we can only construct an application WRTG, i.e., a WRTG representing the result of application, if we can ensure that the language generated by application is in fact recognizable. Of the types under consideration, only wxLNT and wLNT preserve forward recognizability. The two classes marked as open questions and the other classes, which are superclasses of wNT, do not or are presumed not to. All subclasses of wxLT preserve backward recognizability.7 We do not consider cases where recognizability is not preserved tshuamt in the remainder of this paper. If a transducer M of a class that preserves forward recognizability is applied to a WRTG G, we can call the forward ap7Note that the introduction of weights limits recognizability preservation considerably. For example, (unweighted) xT preserves backward recognizability. plication WRTG M(G). and if M preserves backward recognizability, we can call the backward application WRTG M(G)/. Now that we have explained the application problem in the context of weighted tree transducers and determined the classes for which application is possible, let us consider how to build forward and backward application WRTGs. Our basic approach mimics that taken for WSTs by using an embed-compose-project strategy. As in string world, if we can embed the input in a transducer, compose with the given transducer, and project the result, we can obtain the application WRTG. Embedding a WRTG in a wLNT is a trivial operation—if the WRTG is in normal form and chain production-free,8 for every production of the form n − →w σ(n1 , . . . , nk), create a rule ofthe form n.σ(x1 , . . . , xk) − →w σ(n1 .x1, . . . , nk.xk). Range × projection of a w− x→LN σT(n is also trivial—for every q ∈ Q and u ∈ T∆ (Q X) create a production of the form q ∈−→w T u(0 where )u 0c is formed from u by replacing al−l → →lea uves of the form q.x with the leaf q, i.e., removing references to variables, and w is the sum of the weights of all rules of the form q.y → u in R.9 Domain projection for wxLT is bq.eyst →exp ulai inne dR b.y way of example. The left side of a rule is preserved, with variables leaves replaced by their associated states from the right side. So, the rule q1.σ(γ(x1) , x2) − →w δ(q2.x2, β(α, q3.x1)) would yield the production q1 q− →w σ(γ(q3) , q2) in the domain projection. Howev− →er, aσ dγe(lqeting rule such as q1.σ(x1 , x2) − →w γ(q2.x2) necessitates the introduction of a new →non γte(rqminal ⊥ that can genienrtartoed aullc toiof nT Σo fw ai nthe wwe niognhtte r1m . The only missing piece in our embed-composeproject strategy is composition. Algorithm 1, which is based on the declarative construction of Maletti (2006), generates the syntactic composition of a wxLT and a wLNT, a generalization of the basic composition construction of Baker (1979). It calls Algorithm 2, which determines the sequences of rules in the second transducer that match the right side of a single rule in the × first transducer. Since the embedded WRTG is of type wLNT, it may be either the first or second argument provided to Algorithm 1, depending on whether the application is forward or backward. We can thus use the embed-compose-project strategy for forward application of wLNT and backward application of wxLT and wxLNT. Note that we cannot use this strategy for forward applica8Without loss of generality we assume this is so, since standard algorithms exist to remove chain productions (Kuich, 1998; E´sik and Kuich, 2003; Mohri, 2009) and convert into normal form (Alexandrakis and Bozapalidis, 1987). 9Finitely many such productions may be formed. 1061 tion of wxLNT, even though that class preserves recognizability. Algorithm 1COMPOSE 1: inputs 2: wxLT M1 = (Q1, Σ, ∆, R1, q10 ) 3: wLNT M2 = (Q2, ∆, Γ, R2, q20 ) 4: outputs 5: wxLT M3 = ((Q1 Q2), Σ, Γ, R3, (q10 , q20 )) such that M3 = (τM1 ; τM2 Q). 6: complexity 7: O(|R1 | max(|R2|size( ˜u), |Q2|)), where ˜u is the × lOar(g|eRst |rimgahtx s(|idRe t|ree in a,n|yQ ru|l))e in R1 8: Let R3be of the form (R30,π) 9: R3 ← (∅, ∅) 10: Ξ ←← ←{ ((q∅,10∅ , q20 )} {seen states} 11 10 : ΨΞ ←← {{((qq10 , q20 ))}} {{speeennd sintagt essta}tes} 1112:: Ψwh ←ile {Ψ( ∅ do) 1123:: (ilqe1 , Ψq26 =) ← ∅ daony element of 14: ← Ψ) \← {a(nqy1 , ql2em)}e 15: for all (q1.y q− −w →1 u) ∈ R1 do 16: for all (z, −w − →2) u∈) )C ∈O RVER(u, M2, q2) do 17: for all (q, x) )∈ ∈∈ C yOdV V(Ez)R ∩(u u(,(QM1 Q2) X) do 18: i fa lql (∈q ,Ξx )th ∈en y 19: qΞ6 ∈ ← Ξ tΞh e∪n {q} 20: ΞΨ ←← ΞΨ ∪∪ {{qq}} 21: r ← ((Ψq1 ← , q 2Ψ) .y {→q }z) 22: rR30 ← ←← (( qR03 ∪ {).ry} 23: π(r)← ←← R π(∪r) { +r} (w1 · w2) 24: return M3 = Ψ 4 Ψ Application of tree transducer cascades What about the case of an input WRTG and a cascade of tree transducers? We will revisit the three strategies for accomplishing application discussed above for the string case. In order for offline composition to be a viable strategy, the transducers in the cascade must be closed under composition. Unfortunately, of the classes that preserve recognizability, only wLNT × is closed under composition (G´ ecseg and Steinby, 1984; Baker, 1979; Maletti et al., 2009; F ¨ul ¨op and Vogler, 2009). However, the general lack of composability of tree transducers does not preclude us from conducting forward application of a cascade. We revisit the bucket brigade approach, which in Section 2 appeared to be little more than a choice of composition order. As discussed previously, application of a single transducer involves an embedding, a composition, and a projection. The embedded WRTG is in the class wLNT, and the projection forms another WRTG. As long as every transducer in the cascade can be composed with a wLNT to its left or right, depending on the application type, application of a cascade is possible. Note that this embed-compose-project process is somewhat more burdensome than in the string case. For strings, application is obtained by a single embedding, a series of compositions, and a single projecAlgorithm 2 COVER 1: inputs 2: u ∈ T∆ (Q1 X) 3: wuT ∈ M T2 = (Q×2, X X∆), Γ, R2, q20 ) 4: state q2 ∈ Q2 ×× × 5: outputs 6: set of pairs (z, w) with z ∈ TΓ ((Q1 Q2) X) fsoetrm ofed p ab yir so (nze, ,o wr m) worieth hsu zcc ∈es Tsful runs× ×on Q Qu )b y × ×ru Xles) in R2, starting from q2, and w ∈ R+∞ the sum of the weights of all such runs,. 7: complexity 8: O(|R2|size(u)) 9: 10: 11: 12: 13: 14: if u(ε) is of the form (q1,x) ∈ Q1× X then zinit ← ((q1 q2), x) else zinit ← ⊥ Πlast ←← ←{(z ⊥init, {((ε, ε), q2)}, 1)} for all← v ∈ pos(u) εsu,εch), tqha)t} u(v) ∈ ∆(k) for some fko ≥r 0ll li nv p ∈ref ipxo osr(ude)r sduoc 15: ≥Π v0 i←n p ∅r 16: for ←all ∅(z, θ, w) ∈ Πlast do 17: rf aorll a(zll, vθ0, ∈w )lv ∈(z Π) such that z(v0) = ⊥ do 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: , dfoor all(θ(v,v0).u(v)(x1,...,xk) −w →0h)∈R2 θ0 ← θ For←m sθubstitution mapping ϕ : (Q2 X) → TΓ((Q1 Q2 X) {⊥}). f→or Ti = 1to× ×k dQo for all v00 ∈ pos(h) such that h(v00) = (q02 , xi∈) for some q20 ∈ Q2 do θ0(vi, v0v00) ← q20 if u(vi) ←is q of the form (q1, x) ∈ Q1 X then ∪ ,ϕ(x)q20 ∈, x Qi) ←× X((q t1h, eqn20), x) else ϕ(q20, xi) ← ⊥ Πv ← Πv {(z[ϕ)( ←h)] ⊥v0 , θ0, w · w0)} ∪ 29: Πlast ← Πv 30: Z ← {z |← ←(z Π, θ, Xw) 31: return {(z, X ∈ (z ,θ ,wX) X∈Πl Πlast} w) | z ∈ Z} ast X tion, whereas application for trees is obtained by a series of (embed, compose, project) operations. 4.1 On-the-fly algorithms We next consider on-the-fly algorithms for application. Similar to the string case, an on-thefly approach is driven by a calling algorithm that periodically needs to know the productions in a WRTG with a common left side nonterminal. The embed-compose-project approach produces an entire application WRTG before any inference algorithm is run. In order to admit an on-the-fly approach we describe algorithms that only generate those productions in a WRTG that have a given left nonterminal. In this section we extend Definition 3. 1 as follows: a WRTG is a 6tuple G = (N, P, n0,M,G) where N, P, and n0 are defined as in Definition 3. 1, and either M = G = ∅,10 or M is a wxLNT and G is a normMal = =fo Grm =, c ∅h,ain production-free WRTG such that Σ, 10In which case Σ, the definition is functionally unchanged from before. 1062 w t[xLypN] LeT (a)pPresOYNer Qovsateiodn?f(Grwe´ca(Fsd¨uMrg(lKeS¨coa punelgsid cowtihuSza,[rtxlbce1.2i]9,l0Nn2tybT091), 984w [txy](pbL Ne)T PrespvatiosYnNero svfebda?ckw(rFdu¨MlSeo¨c apesgl onwtiuza[r,xlcb.2]ei,NlL0t2yT 91)0 Table 1: Preservation of forward and backward recognizability for various classes of top-down tree transducers. Here and elsewhere, the following abbreviations apply: w = weighted, x = extended LHS, L = linear, N = nondeleting, OQ = open question. Square brackets include a superposition of classes. For example, w[x]T signifies both wxT and wT. Algorithm 3 PRODUCE 1: inputs 2: WRTG Gin = (Nin, ∆, Pin, n0, M, G) such that M = (Q, Σ, ∆, R, q0) is a wxLNT and G = (N, Σ, P, n00, M0, G0) is a WRTG in normal form with no chain productions 3: nin ∈ Nin 4: outputs∈ 5: WRTG Gout = (Nout, ∆, Pout, n0, M, G), such that Gin ? Gout and (nin ?−→w G u) ∈ Pout ⇔ (nin − →w u) ∈ M(G). 6: complex−i →ty 7: O(|R| ), where ˜y is the largest left side tree iOn (a|Rny| | rPul|e in R |P|size( y˜) 8: if Pincontains productions of the form nin− →w u then 9: return Gin 10: Nout ← Nin 11: Pout ←← P Nin 12: Let ni←n b Pe of the form (n, q), where n ∈ N and q ∈ Q. × × 13: for all (q.y −f −wt → 1he u) ∈ R do 14: for all (θ, w2) ∈ ∈RE RPL doACE(y,G, n) do 15: Form subs)ti ∈tu RtiEonP LmAaCppEi(nyg, Gϕ, n: Qo X → T∆ (N Q) such that, for all v ∈ ydQ Q(y) × ×and X Xq0 → →∈ Q, (ifN Nth ×ereQ e)x sisutc nh0 h∈a tN, f aonrd a lxl v∈ ∈X y sdu(cyh) th anatd θ q(v∈) = n0 and y(v) = x, t∈he Nn aϕn(dq0 x , x ∈) X= ( snu0c,h hq t0)ha. 16: p0 ((n, q) −w −1 −· −w →2 ϕ(u)) 17: for← ←all ( p ∈, qN)O− −R −M − →(p0 ϕ, N(uo)u)t) do ← 18: Let p b(ke) o.f the form n0− →w δ(n1,...,nk) for 19: δN ∈out ∆ ← Nout ∪ {n0 , . . . , nk } 20: Pout ←← P Nout ∪∪ { {pn} 21: return CHAIN-REM(Gout) M(G).. In the latter case, G is a stand-in for MG ?(G M).,( analogous to the stand-ins for WSAs and G ? WSTs described in Section 2. Algorithm 3, PRODUCE, takes as input a WRTG Gin = (Nin, ∆, Pin, n0, and a desired nonterminal nin and returns another WRTG, Gout that is different from Gin in that it has more productions, specifically those beginning with nin that are in Algorithms using stand-ins should call PRODUCE to ensure the stand-in they are using has the desired productions beginning with the specific nonterminal. Note, then, that M, G) M(G).. PRODUCE obtains the effect of forward applica- Algorithm 4 REPLACE 1: 2: 3: 4: 5: 6: 7: 8: inputs y ∈ TΣ(X) WRTG G = (N, Σ, P, n0, M, G) in normal form, with no chain productions n∈ N outnpu ∈ts N set Π of pairs (θ, w) where θ is a mapping pos(y) → N and w ∈ R+∞ , each pair indicating pa ossu(cyc)ess →ful Nrun a nodn wy b ∈y p Rroductions in G, starting from n, and w is the weight of the run. complexity O(|P|size(y)) 9: Πlast← {({(ε,n)},1)} 10: for all← ←v {∈( { po(εs,(ny)) s,u1c)h} that y(v) ∈ X in prefix order fdoor 11: Πv ← ∅ 12: for ←all ∅(θ, w) ∈ Πlast do 13: ri fa Mll ( w∅) )a ∈nd Π G ∅ then 14: MG ←= ∅PR anOdD GUC6 =E ∅(G th, eθn(v)) = = −w →0 15: for all (θ(v) y(v) (n1, . . . , nk)) ∈ P do 16: Πv ← Πv∪− →{(θ y∪({v ()(vni, ni) , 1≤ )i) ≤ ∈ k P}, d dwo·w0) } 17: Πlast ← Π←v 18: return Πlast Algorithm 5 MAKE-EXPLICIT 1: inputs 2: WRTG G = (N, Σ, P, n0, M, G) in normal form 3: outputs 4: WRTG G0 = (N0, Σ, P0, n0, M, G), in normal form, such that if M ∅ and G ∅, LG0 = LM(G)., and otherwise Gf M0 = G. = = 56:: comOp(|lePx0it|y) 7: G0← 8: Ξ ←← { nG0} {seen nonterminals} 89:: ΞΨ ←← {{nn0}} {{speeenndi nnogn tneornmteinramlsi}nals} 190:: wΨh ←ile {Ψn =} ∅{ pdeon 11 10 : inl e← Ψa6n =y ∅el deoment of 12: nΨ ←←a nΨy \ e l{emn}e 13: iΨf M ← ∅\ a{nnd} G ∅ then 14: MG0 =← ∅ P aRnOdD GU 6=CE ∅(G the0,n nn) 15: for all (n P−→w RO σ(n1 , . . . , nk)) ∈ P0 do 16: for i= 1→ →to σ (kn ndo 17: if ni ∈ Ξ then 18: Ξ ←∈ Ξ ΞΞ t h∪e {nni} 19: ΞΨ ←← ΞΨ ∪∪ {{nni}} 20: return G0 Ψ = = 1063 g0 g0 −w − →1 −−w →→2 g0 − − → σ(g0, g1) α w − →3 g1 − − → α (a) Input WRTG G G a0 a0.σ(x1, x2) −w − →4 − w − → →5 σ(a0.x1, a1.x2) a0.σ(x1, x2) ψ(a2.x1, a1.x2) a0 .α − − → α a 1.α − − → α a2 .α w − →6 (w −a → →7 w − →8 −−→ ρ (b) First transducer MA in the cascade b0 b0.σ(x1, x2) b0.α −w −1 →0 α −w − →9 σ(b0.x1, b0.x2) (c) Second transducer MB in the cascade g0a0 g0a0 −w −1 −· −w →4 σ(g0a0, g1a1) −−w −− 1− − ·w − − → →5 ψ(g0a2, g1a1) − − −·w − → α g1a1 − − −·w − → α w −− 2 −− − · w−− → →6 g0a0 w − 3 − −· w− → →7 (d) Productions of MA (G). built as a consequence of building the complete MB(MA(G).). g0a0b0 g0a0b0 −w −1 −· −w4 −·w − →9 σ(g0a0b0, g1a1b0) g0a0b0 −−w − − −2 −· −w −6 − −·−w − → −1 →0 σ α g1a1b0 −w − −3· w−7 −· −w −1 →0 α (e) Complete MB (MA (G).). Figure 2: Forward application through a cascade of tree transducers using an on-the-fly method. tion in an on-the-fly manner.11 It makes calls to REPLACE, which is presented in Algorithm 4, as well as to a NORM algorithm that ensures normal form by replacing a single production not in normal form with several normal-form productions that can be combined together (Alexandrakis and Bozapalidis, 1987) and a CHAIN-REM algorithm that replaces a WRTG containing chain productions with an equivalent WRTG that does not (Mohri, 2009). As an example of stand-in construction, consider the invocation PRODUCE(G1, g0a0), where iGs1 in= F (i{g u0rae0 2}a, 1 {2σa,nψd,α M,ρA},is ∅ i,n g 20ab0., T MheA s,ta Gn)d,-i Gn WRTG that is output contains the first three of the four productions in Figure 2d. To demonstrate the use of on-the-fly application in a cascade, we next show the effect of PRODUCE when used with the cascade G ◦ MA ◦ MB, wDhUeCreE MwhBe i uss eind wFitighu three c2acs. Oe uGr dMrivin◦gM algorithm in this case is Algorithm 5, MAKE11Note further that it allows forward application of class wxLNT, something the embed-compose-project approach did not allow. 12By convention the initial nonterminal and state are listed first in graphical depictions of WRTGs and WXTTs. rJJ.JJ(x1, x2, x3) → JJ(rDT.x1, rJJ.x2, rVB.x3) rVB.VB(x1, x2, )x− 3→) → JJ VrB(rNNPS.x1, rNN.x3, rVB.x2) t.”gentle” − → ”gentle”(a) Rotation rules iVB.NN(x1, x2) iVB.NN(x1, x2)) iVB.NN(x1, x2)) → →→ →→ NN(INS iNN.x1, iNN.x2) NNNN((iINNNS.x i1, iNN.x2) NNNN((iiNN.x1, iNN.x2, INS) (b) Insertion rules t.VB(x1 , x2, x3) → X(t.x1 , t.x2, t.x3) t.”gentleman” →) → j →1 t . ””ggeennttl eemmaann”” →→ jE1PS t . ”INgSen →tle m j 1a t . I NNSS →→ j 21 (c) Translation rules Figure 3: Example rules from transducers used in decoding experiment. j 1 and j2 are Japanese words. EXPLICIT, which simply generates the full application WRTG using calls to PRODUCE. The input to MAKE-EXPLICIT is G2 = ({g0a0b0}, {σ, α}, ∅, g0a0b0, MB, G1).13 MAKE=-E ({XgPLICI}T, c{aσl,lsα }P,R ∅O, gDUCE(G2, g0a0b0). PRODUCE then seeks to cover b0.σ(x1, x2) σ(b0.x1, b0.x2) with productions from G1, wh−i →ch i σs (ab stand-in for −w →9 MA(G).. At line 14 of REPLACE, G1 is improved so that it has the appropriate productions. The productions of MA(G). that must be built to form the complete MB (MA(G).). are shown in Figure 2d. The complete MB (MA(G).). is shown in Figure 2e. Note that because we used this on-the-fly approach, we were able to avoid building all the productions in MA(G).; in particular we did not build g0a2 − −w2 −· −w →8 ρ, while a bucket brigade approach would −h −a −v −e → →bui ρlt, ,t whish production. We have also designed an analogous onthe-fly PRODUCE algorithm for backward application on linear WTT. We have now defined several on-the-fly and bucket brigade algorithms, and also discussed the possibility of embed-compose-project and offline composition strategies to application of cascades of tree transducers. Tables 2a and 2b summarize the available methods of forward and backward application of cascades for recognizabilitypreserving tree transducer classes. 5 Decoding Experiments The main purpose of this paper has been to present novel algorithms for performing applica- tion. However, it is important to demonstrate these algorithms on real data. We thus demonstrate bucket-brigade and on-the-fly backward application on a typical NLP task cast as a cascade of wLNT. We adapt the Japanese-to-English transla13Note that G2 is the initial stand-in for MB (MA (G).)., since G1 is the initial stand-in for MA (G).. 1064 obomcbtfethodW√ √STwx√L× NTwL√ √NTo mbctbfethodW√ √STw×x√ LTw√ ×LTwxL√ ×NTwL√ √NT (a) Forward application (b) Backward application Table 2: Transducer types and available methods of forward and backward application of a cascade. oc = offline composition, bb = bucket brigade, otf = on the fly. tion model of Yamada and Knight (2001) by transforming it from an English-tree-to-Japanese-string model to an English-tree-to-Japanese-tree model. The Japanese trees are unlabeled, meaning they have syntactic structure but all nodes are labeled “X”. We then cast this modified model as a cascade of LNT tree transducers. Space does not permit a detailed description, but some example rules are in Figure 3. The rotation transducer R, a samparlee ionf Fwighuicreh 3is. Tinh Fei rgoutareti o3na, t rhaanss d6u,4c5e3r R ru,l eas s, tmheinsertion transducer I,Figure 3b, has 8,122 rules, iannsde rtthieon ntr trananssladtuiocne rtr Ia,n Fsidguucreer, 3 bT, , Fasig 8u,r1e2 32c r,u lheass, 3a7nd,31 th h1e ertu rlaenss. We add an English syntax language model L to theW ceas acdadde a no Ef ntrgalinsshd uscyentras x ju lastn gdueascgrei mbeodd etol L be ttoter simulate an actual machine translation decoding task. The language model is cast as an identity WTT and thus fits naturally into the experimental framework. In our experiments we try several different language models to demonstrate varying performance of the application algorithms. The most realistic language model is a PCFG. Each rule captures the probability of a particular sequence of child labels given a parent label. This model has 7,765 rules. To demonstrate more extreme cases of the usefulness of the on-the-fly approach, we build a language model that recognizes exactly the 2,087 trees in the training corpus, each with equal weight. It has 39,455 rules. Finally, to be ultraspecific, we include a form of the “specific” language model just described, but only allow the English counterpart of the particular Japanese sentence being decoded in the language. The goal in our experiments is to apply a single tree t backward through the cascade L◦R◦I◦T ◦t tarnede tfi bndac kthwe 1rd-b tehsrto pugathh hine tchaes caapdpeli Lca◦tiRon◦ IW◦RTTG ◦t. We evaluate the speed of each approach: bucket brigade and on-the-fly. The algorithm we use to obtain the 1-best path is a modification of the kbest algorithm of Pauls and Klein (2009). Our algorithm finds the 1-best path in a WRTG and admits an on-the-fly approach. The results of the experiments are shown in Table 3. As can be seen, on-the-fly application is generally faster than the bucket brigade, about double the speed per sentence in the traditional L1eMp-xcsafe tgcyn tpemb u eo ct hkfoe tdime>/.21s 0.e78465nms tenc Table 3: Timing results to obtain 1-best from application through a weighted tree transducer cascade, using on-the-fly vs. bucket brigade backward application techniques. pcfg = model recognizes any tree licensed by a pcfg built from observed data, exact = model recognizes each of 2,000+ trees with equal weight, 1-sent = model recognizes exactly one tree. experiment that uses an English PCFG language model. The results for the other two language models demonstrate more keenly the potential advantage that an on-the-fly approach provides—the simultaneous incorporation of information from all models allows application to be done more effectively than if each information source is considered in sequence. In the “exact” case, where a very large language model that simply recognizes each of the 2,087 trees in the training corpus is used, the final application is so large that it overwhelms the resources of a 4gb MacBook Pro, while the on-the-fly approach does not suffer from this problem. The “1-sent” case is presented to demonstrate the ripple effect caused by using on-the fly. In the other two cases, a very large language model generally overwhelms the timing statistics, regardless of the method being used. But a language model that represents exactly one sentence is very small, and thus the effects of simultaneous inference are readily apparent—the time to retrieve the 1-best sentence is reduced by two orders of magnitude in this experiment. 6 Conclusion We have presented algorithms for forward and backward application of weighted tree transducer cascades, including on-the-fly variants, and demonstrated the benefit of an on-the-fly approach to application. We note that a more formal approach to application of WTTs is being developed, 1065 independent from these efforts, by F ¨ul ¨op (2010). et al. Acknowledgments We are grateful for extensive discussions with Andreas Maletti. We also appreciate the insights and advice of David Chiang, Steve DeNeefe, and others at ISI in the preparation of this work. Jonathan May and Kevin Knight were supported by NSF grants IIS-0428020 and IIS0904684. Heiko Vogler was supported by DFG VO 1011/5-1. References Athanasios Alexandrakis and Symeon Bozapalidis. 1987. Weighted grammars and Kleene’s theorem. Information Processing Letters, 24(1): 1–4. Brenda S. Baker. 1979. Composition of top-down and bottom-up tree transductions. Information and Control, 41(2): 186–213. Zolt a´n E´sik and Werner Kuich. 2003. Formal tree series. Journal of Automata, Languages and Combinatorics, 8(2):219–285. Zolt a´n F ¨ul ¨op and Heiko Vogler. 2009. Weighted tree automata and tree transducers. In Manfred Droste, Werner Kuich, and Heiko Vogler, editors, Handbook of Weighted Automata, chapter 9, pages 3 13–404. Springer-Verlag. Zolt a´n F ¨ul ¨op, Andreas Maletti, and Heiko Vogler. 2010. Backward and forward application of weighted extended tree transducers. Unpublished manuscript. Ferenc G ´ecseg and Magnus Steinby. 1984. Tree Automata. Akad e´miai Kiad o´, Budapest. Liang Huang and David Chiang. 2005. Better k-best parsing. In Harry Bunt, Robert Malouf, and Alon Lavie, editors, Proceedings of the Ninth International Workshop on Parsing Technologies (IWPT), pages 53–64, Vancouver, October. Association for Computational Linguistics. Werner Kuich. 1998. Formal power series over trees. In Symeon Bozapalidis, editor, Proceedings of the 3rd International Conference on Developments in Language Theory (DLT), pages 61–101, Thessaloniki, Greece. Aristotle University of Thessaloniki. Werner Kuich. 1999. Tree transducers and formal tree series. Acta Cybernetica, 14: 135–149. Andreas Maletti, Jonathan Graehl, Mark Hopkins, and Kevin Knight. 2009. The power of extended topdown tree transducers. SIAM Journal on Computing, 39(2):410–430. Andreas Maletti. 2006. Compositions of tree series transformations. Theoretical Computer Science, 366:248–271. Andreas Maletti. 2008. Compositions of extended topdown tree transducers. Information and Computation, 206(9–10): 1187–1 196. Andreas Maletti. 2009. Personal Communication. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2000. The design principles of a weighted finite-state transducer library. Theoretical Computer Science, 231: 17–32. Mehryar Mohri. 1997. Finite-state transducers in language and speech processing. Computational Lin- guistics, 23(2):269–312. Mehryar Mohri. 2009. Weighted automata algorithms. In Manfred Droste, Werner Kuich, and Heiko Vogler, editors, Handbook of Weighted Automata, chapter 6, pages 213–254. Springer-Verlag. Adam Pauls and Dan Klein. 2009. K-best A* parsing. In Keh-Yih Su, Jian Su, Janyce Wiebe, and Haizhou Li, editors, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 958–966, Suntec, Singapore, August. Association for Computational Linguistics. Fernando Pereira and Michael Riley. 1997. Speech recognition by composition of weighted finite automata. In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Processing, chapter 15, pages 431–453. MIT Press, Cambridge, MA. William A. Woods. 1980. Cascaded ATN grammars. American Journal of Computational Linguistics, 6(1): 1–12. Kenji Yamada and Kevin Knight. 2001. A syntaxbased statistical translation model. In Proceedings of 39th Annual Meeting of the Association for Computational Linguistics, pages 523–530, Toulouse, France, July. Association for Computational Linguistics. 1066

4 0.565566 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

Author: Eric Corlett ; Gerald Penn

Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.

5 0.5507977 100 acl-2010-Enhanced Word Decomposition by Calibrating the Decision Threshold of Probabilistic Models and Using a Model Ensemble

Author: Sebastian Spiegler ; Peter A. Flach

Abstract: This paper demonstrates that the use of ensemble methods and carefully calibrating the decision threshold can significantly improve the performance of machine learning methods for morphological word decomposition. We employ two algorithms which come from a family of generative probabilistic models. The models consider segment boundaries as hidden variables and include probabilities for letter transitions within segments. The advantage of this model family is that it can learn from small datasets and easily gen- eralises to larger datasets. The first algorithm PROMODES, which participated in the Morpho Challenge 2009 (an international competition for unsupervised morphological analysis) employs a lower order model whereas the second algorithm PROMODES-H is a novel development of the first using a higher order model. We present the mathematical description for both algorithms, conduct experiments on the morphologically rich language Zulu and compare characteristics of both algorithms based on the experimental results.

6 0.45882851 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

7 0.45774224 97 acl-2010-Efficient Path Counting Transducers for Minimum Bayes-Risk Decoding of Statistical Machine Translation Lattices

8 0.43784392 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages

9 0.41478229 68 acl-2010-Conditional Random Fields for Word Hyphenation

10 0.39368221 13 acl-2010-A Rational Model of Eye Movement Control in Reading

11 0.38409144 170 acl-2010-Letter-Phoneme Alignment: An Exploration

12 0.37152183 21 acl-2010-A Tree Transducer Model for Synchronous Tree-Adjoining Grammars

13 0.3374272 74 acl-2010-Correcting Errors in Speech Recognition with Articulatory Dynamics

14 0.3315438 137 acl-2010-How Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies

15 0.3278648 61 acl-2010-Combining Data and Mathematical Models of Language Change

16 0.32497433 254 acl-2010-Using Speech to Reply to SMS Messages While Driving: An In-Car Simulator User Study

17 0.3225379 195 acl-2010-Phylogenetic Grammar Induction

18 0.31678313 41 acl-2010-Automatic Selectional Preference Acquisition for Latin Verbs

19 0.31393799 214 acl-2010-Sparsity in Dependency Grammar Induction

20 0.29776791 186 acl-2010-Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.011), (3, 0.01), (4, 0.017), (7, 0.027), (14, 0.031), (20, 0.014), (25, 0.048), (39, 0.013), (41, 0.019), (42, 0.032), (44, 0.02), (59, 0.097), (69, 0.213), (73, 0.035), (76, 0.012), (78, 0.042), (80, 0.016), (83, 0.082), (84, 0.033), (98, 0.141)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84870148 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

Author: Mark Johnson

Abstract: This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as “topic models” to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of PCFG, so Bayesian inference for PCFGs can be used to infer Topic Models as well. Adaptor Grammars (AGs) are a hierarchical, non-parameteric Bayesian extension of PCFGs. Exploiting the close relationship between LDA and PCFGs just described, we propose two novel probabilistic models that combine insights from LDA and AG models. The first replaces the unigram component of LDA topic models with multi-word sequences or collocations generated by an AG. The second extension builds on the first one to learn aspects of the internal structure of proper names.

same-paper 2 0.83132529 116 acl-2010-Finding Cognate Groups Using Phylogenies

Author: David Hall ; Dan Klein

Abstract: A central problem in historical linguistics is the identification of historically related cognate words. We present a generative phylogenetic model for automatically inducing cognate group structure from unaligned word lists. Our model represents the process of transformation and transmission from ancestor word to daughter word, as well as the alignment between the words lists of the observed languages. We also present a novel method for simplifying complex weighted automata created during inference to counteract the otherwise exponential growth of message sizes. On the task of identifying cognates in a dataset of Romance words, our model significantly outperforms a baseline ap- proach, increasing accuracy by as much as 80%. Finally, we demonstrate that our automatically induced groups can be used to successfully reconstruct ancestral words.

3 0.78460735 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking

Author: Radu Soricut ; Abdessamad Echihabi

Abstract: The adoption ofMachine Translation technology for commercial applications is hampered by the lack of trust associated with machine-translated output. In this paper, we describe TrustRank, an MT system enhanced with a capability to rank the quality of translation outputs from good to bad. This enables the user to set a quality threshold, granting the user control over the quality of the translations. We quantify the gains we obtain in translation quality, and show that our solution works on a wide variety of domains and language pairs.

4 0.73573011 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures

Author: Carlos Gomez-Rodriguez ; Joakim Nivre

Abstract: Finding a class of structures that is rich enough for adequate linguistic representation yet restricted enough for efficient computational processing is an important problem for dependency parsing. In this paper, we present a transition system for 2-planar dependency trees trees that can be decomposed into at most two planar graphs and show that it can be used to implement a classifier-based parser that runs in linear time and outperforms a stateof-the-art transition-based parser on four data sets from the CoNLL-X shared task. In addition, we present an efficient method – – for determining whether an arbitrary tree is 2-planar and show that 99% or more of the trees in existing treebanks are 2-planar.

5 0.68019569 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

Author: Fei Huang ; Alexander Yates

Abstract: Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Semantic role labeling techniques are typically trained on newswire text, and in tests their performance on fiction is as much as 19% worse than their performance on newswire text. We investigate techniques for building open-domain semantic role labeling systems that approach the ideal of a train-once, use-anywhere system. We leverage recently-developed techniques for learning representations of text using latent-variable language models, and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. In experiments, our novel system reduces error by 16% relative to the previous state of the art on out-of-domain text.

6 0.67682147 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

7 0.67614615 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features

8 0.67308629 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

9 0.67281568 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

10 0.67254055 93 acl-2010-Dynamic Programming for Linear-Time Incremental Parsing

11 0.67240858 133 acl-2010-Hierarchical Search for Word Alignment

12 0.67226303 207 acl-2010-Semantics-Driven Shallow Parsing for Chinese Semantic Role Labeling

13 0.67219865 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

14 0.6717248 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

15 0.67154169 214 acl-2010-Sparsity in Dependency Grammar Induction

16 0.66956067 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems

17 0.66940641 16 acl-2010-A Statistical Model for Lost Language Decipherment

18 0.66934431 170 acl-2010-Letter-Phoneme Alignment: An Exploration

19 0.66898543 202 acl-2010-Reading between the Lines: Learning to Map High-Level Instructions to Commands

20 0.66848636 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery