acl acl2011 acl2011-268 knowledge-graph by maker-knowledge-mining

268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation


Source: pdf

Author: Ashish Vaswani ; Haitao Mi ; Liang Huang ; David Chiang

Abstract: Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). [sent-5, score-0.813]

2 Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. [sent-6, score-0.496]

3 Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. [sent-7, score-1.174]

4 Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules. [sent-8, score-0.491]

5 1 Introduction Statistical machine translation systems typically model the translation process as a sequence of translation steps, each of which uses a translation rule, for example, a phrase pair in phrase-based translation or a tree-to-string rule in tree-to-string translation. [sent-9, score-1.109]

6 These rules are usually applied independently of each other, which violates the conventional wisdom that translation should be done in context. [sent-10, score-0.372]

7 Although this approach does improve translation quality dramatically by weakening the independence assumptions in the translation model, they suffer from two main problems. [sent-12, score-0.408]

8 To avoid this, ad-hoc limits are placed during composition, like upper bounds on the number of nodes in the composed rule, or the height of the rule. [sent-14, score-0.295]

9 Nevertheless, the advantages outweigh the disadvantages, and to our knowledge, all top-performing systems, both phrase-based and syntax-based, use composed rules. [sent-17, score-0.233]

10 (2004) initially built a syntax-based system using only minimal rules, and subsequently reported (Galley et al. [sent-19, score-0.147]

11 6 points, while increasing grammar size 60-fold and decoding time 15-fold. [sent-21, score-0.155]

12 The alternative we propose is to replace composed rules with a rule Markov model that generates rules conditioned on their context. [sent-22, score-1.1]

13 In this work, we restrict a rule’s context to the vertical chain of ancestors of the rule. [sent-23, score-0.32]

14 This ancestral context would play the same role as the context formerly provided by rule composition. [sent-24, score-0.539]

15 However, their study leaves unanswered whether a rule Markov model can take the place of composed rules. [sent-26, score-0.697]

16 In this work, we investigate the use of rule Markov models in the context of treeProceedinPgosrt olafn thde, 4 O9rtehg Aonn,n Juuanle M 1e9e-2tin4g, 2 o0f1 t1h. [sent-27, score-0.486]

17 First, we carry out a detailed comparison of rule Markov models with composed rules. [sent-33, score-0.687]

18 Our experiments show that, using trigram rule Markov models, we achieve an improvement of 2. [sent-34, score-0.515]

19 When we compare against vertically composed rules, we find that our rule Markov model has the same accuracy, but our model is much smaller and decoding with our model is 30% faster. [sent-36, score-1.043]

20 When we compare against full composed rules, we find that our rule Markov model still often reaches the same level of accuracy, again with savings in space and time. [sent-37, score-0.721]

21 Second, we investigate methods for pruning rule Markov models, finding that even very simple pruning criteria actually improve the accuracy of the model, while of course decreasing its size. [sent-38, score-0.514]

22 Third, we present a very fast decoder for tree-to- string grammars with rule Markov models. [sent-39, score-0.564]

23 Huang and Mi (2010) have recently introduced an efficient incremental decoding algorithm for tree-to-string translation, which operates top-down and maintains a derivation history of translation rules encountered. [sent-40, score-0.743]

24 This history is exactly the vertical chain of ancestors corresponding to the contexts in our rule Markov model, which makes it an ideal decoder for our model. [sent-41, score-0.906]

25 We start by describing our rule Markov model (Section 2) and then how to decode using the rule Markov model (Section 3). [sent-42, score-0.928]

26 2 Rule Markov models Our model which conditions the generation of a rule on the vertical chain of its ancestors, which allows it to capture interactions between rules. [sent-43, score-0.703]

27 Consider the example Chinese-English tree-tostring grammar in Figure 1 and the example derivation in Figure 2. [sent-44, score-0.173]

28 Each row is a derivation step; the tree on the left is the derivation tree (in which each node is a rule and its children are the rules that substitute into it) and the tree pair on the right is the source and target derived tree. [sent-45, score-1.15]

29 For any derivation node r, let anc1(r) be the parent of r (or ǫ if it has no parent), anc2(r) be the grandparent of node r (or ǫ if it has no grandparent), and so on. [sent-46, score-0.395]

30 Let anc1n(r) be the chain of ancestors anc1(r) · · · ancn(r). [sent-47, score-0.151]

31 With probability P(r1 | ǫ), we generate the rule at the root node, r1. [sent-49, score-0.453]

32 We the|n ǫ generate eruralete r2 we riuthle probability P(r2 | r1), and so on, always taking the leftmost open subst|it rution site on the English derived tree, and generating a rule ri conditioned on its chain of ancestors with probability P(ri | anc1n(ri)). [sent-50, score-0.713]

33 (2004) on word-aligned parallel text to obtain a single derivation of minimal rules for each sentence pair. [sent-53, score-0.463]

34 ) The rule Markov model can then be trained on the path set of these derivation trees. [sent-55, score-0.593]

35 We experiment with bigram and trigram rule Markov models. [sent-58, score-0.566]

36 2 : Bush held talks with Sharon Figure 2: Example tree-to-string derivation. [sent-74, score-0.225]

37 Each row shows symbol is rewritten using one of the rules in Figure 1. [sent-75, score-0.187]

38 Pruning In addition to full n-gram Markov models, we experiment with three approaches to build smaller models to investigate if pruning helps. [sent-83, score-0.133]

39 Our results will show that smaller models indeed give a higher B score than the full bigram and trigram models. [sent-84, score-0.228]

40 A context is added if the KL divergence between its predictive distribution and that of its parent is above a certain threshold and the probability of observing the context is above another threshold. [sent-92, score-0.133]

41 3 Tree-to-string decoding with rule Markov models In this paper, we use our rule Markov model framework in the context of tree-to-string translation. [sent-93, score-1.061]

42 The input to the translation system is a source parse tree and the output is the target string. [sent-97, score-0.196]

43 Huang and Mi (2010) have recently introduced an efficient incremental decoding algorithm for tree-to-string translation. [sent-98, score-0.179]

44 The decoder operates top-down and maintains a derivation history of translation rules encountered. [sent-99, score-0.657]

45 The history is exactly the vertical chain of ancestors corresponding to the contexts in our rule Markov model. [sent-100, score-0.813]

46 3 y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Figure 3: Example input parse tree with tree addresses. [sent-113, score-0.134]

47 makes incremental decoding a natural fit with our generative story. [sent-114, score-0.179]

48 In this section, we describe how to integrate our rule Markov model into this incremental decoding algorithm. [sent-115, score-0.643]

49 Note that it is also possible to integrate our rule Markov model with other decoding algorithms, for example, the more common non-incremental top-down/bottom-up approach (Huang et al. [sent-116, score-0.575]

50 , 2006), but it would involve a non-trivial change to the decoding algorithms to keep track of the vertical derivation history, which would result in significant overhead. [sent-117, score-0.4]

51 Algorithm Given the input parse tree in Figure 3, Figure 4 illustrates the search process of the incremental decoder with the grammar of Figure 1. [sent-118, score-0.272]

52 We write X@η for a tree node with label X at tree address η (Shieber et al. [sent-119, score-0.21]

53 The root node has address ǫ, and the ith child of node η has address η. [sent-121, score-0.152]

54 At each step, the decoder maintains a stack of active rules, which are rules that have not been completed yet, and the rightmost (n 1) English words translated tahnuds tfhaer (the hypothesis), )w Ehnergeli n his w thoerd osr tdraenr oslfa ttehed word language model (in Figure 4, n = 2). [sent-123, score-0.436]

55 The last column in the figure shows the rule Markov model probabilities with the conditioning context. [sent-125, score-0.501]

56 In this example, we use a trigram rule Markov model. [sent-126, score-0.515]

57 After initialization, the process starts at step 1, where we predict rule r1 (the shaded rule) with probability P(r1 | ǫ) and push its English side onto the stack, with |va ǫr)ia abnldes replaced by tshhe s corresponding tree nodes: x1 becomes NP@1 and x2 becomes VP@2. [sent-127, score-0.585]

58 Figure 4: Simulation of incremental decoding with rule Markov model. [sent-280, score-0.607]

59 We then predict lexical rule r2 with probability P(r2 | r1) and push rule r2 onto the stack: [? [sent-291, score-0.912]

60 Bush] In step 3, we perform a scan operation, in which we append the English word just after the dot to the current hypothesis and move the dot after the word. [sent-293, score-0.16]

61 Since the dot is at the end of the top rule in the stack, we perform a complete operation in step 4 where we pop the finished rule at the top of the stack. [sent-294, score-0.938]

62 In the scan and complete steps, we don’t need to compute rule probabilities. [sent-295, score-0.458]

63 However, our rule Markov model has the correct preference because of the conditioning ancestral sequence (r3, r4), shown in Figure 5. [sent-300, score-0.548]

64 However, if one were to use rule Markov models with a conventional CKY-style 861 bottom-up decoder (Liu et al. [sent-305, score-0.573]

65 , 2006), the complexity would increase to O(nCm−1 |V|4(g−1)), where C is the maximum number of outgoing hyperedges for each node in the translation forest, and m is the order of the rule Markov model. [sent-306, score-0.667]

66 We trained our rule Markov model on derivations of minimal rules as described above. [sent-315, score-0.798]

67 At decoding time, we again parse the input sentences using the Berkeley parser, and convert them into translation forests using rule patternmatching (Mi et al. [sent-320, score-0.668]

68 We used grammars of minimal rules and composed rules of maximum height 3 as our baselines. [sent-326, score-0.859]

69 Using the best bigram rule Markov models and the minimal rule grammar gives us an improvement of 1. [sent-328, score-1.152]

70 Using the best trigram rule Markov model brings our gain up to 2. [sent-330, score-0.551]

71 Our trigram rule Markov model strongly outperforms minimal rules, and performs at the same level as composed and vertically composed rules, but is smaller and faster. [sent-333, score-1.327]

72 We find that by just using bigram context, we are able to get at least 1 B point higher than the minimal rule grammar. [sent-337, score-0.626]

73 It is interesting to see that using just bigram rule interactions can give us a reasonable boost. [sent-338, score-0.479]

74 We get our highest gains from using trigram context where our best performing rule Markov model gives us 2. [sent-339, score-0.611]

75 We also compared rule Markov models against composed rules. [sent-342, score-0.687]

76 Since our models are currently limited to conditioning on vertical context, the closest comparison is against vertically composed rules. [sent-343, score-0.556]

77 Comparing against full composed rules, we find that our system matches the score of the baseline composed rule grammar of maximum height 3, while using many fewer parameters. [sent-345, score-1.024]

78 (It should be noted that a parameter in the rule Markov model is just a floating-point number, whereas a parameter in the composed-rule system is an entire rule; therefore the difference in memory usage would be even greater. [sent-346, score-0.464]

79 These experiments clearly show that rule Markov models with minimal rules increase translation quality significantly and with lower memory requirements than composed rules. [sent-349, score-1.15]

80 One might wonder if the best performance can be obtained by combining composed rules with a rule Markov model. [sent-350, score-0.848]

81 s89 e nt) Table 2: For rule bigrams, RM-B with D1 = 0. [sent-354, score-0.428]

82 s02 e nt) Table 3: For rule bigrams, RM-A with D1,D2 = 0. [sent-360, score-0.428]

83 is straightforward to implement: the rule Markov model is still defined over derivations of minimal rules, but in the decoder’s prediction step, the rule Markov model’s value on a composed rule is calculated by decomposing it into minimal rules and computing the product of their probabilities. [sent-362, score-2.034]

84 We find that using our best trigram rule Markov model with composed rules gives us a 0. [sent-363, score-0.999]

85 5 B gain on top of the composed rule grammar, statistically significant with p < 0. [sent-364, score-0.661]

86 3 Analysis Tables 2 and 3 show how the various types of rule Markov models compare, for bigrams and trigrams, 1For this experiment, a beam size of 100 was used. [sent-368, score-0.529]

87 h/021seRnMt) Table 6: Adding rule Markov models to composed-rule grammars improves their translation performance. [sent-377, score-0.626]

88 298e nt) Table 5: Comparison of vertically composed rules using various settings (maximum rule height 7). [sent-391, score-1.033]

89 It is interesting that the full bigram and trigram rule Markov models do not give our high- est B scores; pruning the models not only saves space but improves their performance. [sent-393, score-0.685]

90 Table 5 shows the performance of vertically composed rules at various settings. [sent-396, score-0.543]

91 Table 6 shows the performance of fully composed rules and fully composed rules with a rule Markov Model at various settings. [sent-398, score-1.268]

92 9 million rules), the drop in B score resulting from adding the rule Markov model is not statistically significant. [sent-400, score-0.464]

93 863 efforts both using a rule bigram model in machine translation, that is, the probability of the current rule only depends on the immediate previous rule in the vertical context, whereas our rule Markov model can condition on longer and sparser derivation histories. [sent-402, score-2.126]

94 Outside of machine translation, the idea of weakening independence assumptions by modeling the derivation history is also found in parsing (Johnson, 1998), where rule probabilities are conditioned on parent and grand-parent nonterminals. [sent-405, score-0.82]

95 First, our work conditions rule probabilities on parent and grandparent rules, not just nonterminals. [sent-407, score-0.564]

96 6 Conclusion In this paper, we have investigated whether we can eliminate composed rules without any loss in translation quality. [sent-410, score-0.549]

97 We have developed a rule Markov model that captures vertical bigrams and trigrams of minimal rules, and tested it in the framework of treeto-string translation. [sent-411, score-0.821]

98 First, our rule Markov models dramatically improve a grammar of minimal rules, giving an improvement of 2. [sent-413, score-0.668]

99 Second, when we compare against vertically composed rules we are able to get about the same B score, but our model is much smaller and decoding with our model is faster. [sent-415, score-0.766]

100 Finally, when we compare against full composed rules, we find that we can reach the same level of performance under some conditions, but in order to do so consistently, we believe we need to extend our model to condition on horizontal context in addition to vertical context. [sent-416, score-0.462]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rule', 0.428), ('vp', 0.333), ('markov', 0.285), ('ip', 0.283), ('composed', 0.233), ('np', 0.217), ('rules', 0.187), ('bush', 0.161), ('talks', 0.152), ('minimal', 0.147), ('vertical', 0.137), ('derivation', 0.129), ('translation', 0.129), ('vertically', 0.123), ('decoding', 0.111), ('ancestors', 0.097), ('decoder', 0.093), ('pp', 0.093), ('stack', 0.087), ('trigram', 0.087), ('huang', 0.076), ('node', 0.076), ('held', 0.073), ('grandparent', 0.07), ('weakening', 0.07), ('dn', 0.068), ('incremental', 0.068), ('tree', 0.067), ('history', 0.063), ('height', 0.062), ('mi', 0.062), ('sharon', 0.059), ('galley', 0.058), ('chain', 0.054), ('bigram', 0.051), ('menezes', 0.051), ('dop', 0.051), ('dot', 0.048), ('ancestral', 0.047), ('bejerano', 0.047), ('pabs', 0.047), ('rurl', 0.047), ('ding', 0.047), ('parent', 0.044), ('grammar', 0.044), ('grammars', 0.043), ('bigrams', 0.043), ('pruning', 0.043), ('quirk', 0.042), ('discount', 0.041), ('pst', 0.041), ('smaller', 0.04), ('haitao', 0.04), ('knight', 0.038), ('contracts', 0.038), ('liu', 0.037), ('formed', 0.037), ('conditioning', 0.037), ('model', 0.036), ('step', 0.034), ('treelet', 0.034), ('hyperedges', 0.034), ('fossum', 0.034), ('contexts', 0.034), ('maintains', 0.033), ('kevin', 0.032), ('context', 0.032), ('beam', 0.032), ('sh', 0.031), ('isi', 0.031), ('push', 0.031), ('independence', 0.031), ('trigrams', 0.03), ('steve', 0.03), ('scan', 0.03), ('vv', 0.03), ('wisdom', 0.03), ('rc', 0.03), ('nt', 0.029), ('leftmost', 0.029), ('liang', 0.029), ('conditioned', 0.029), ('gives', 0.028), ('southern', 0.027), ('arrows', 0.027), ('ri', 0.026), ('deneefe', 0.026), ('conventional', 0.026), ('assumptions', 0.026), ('models', 0.026), ('probability', 0.025), ('shieber', 0.024), ('full', 0.024), ('dramatically', 0.023), ('sciences', 0.023), ('keep', 0.023), ('forest', 0.023), ('operates', 0.023), ('conditions', 0.022), ('rt', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

Author: Ashish Vaswani ; Haitao Mi ; Liang Huang ; David Chiang

Abstract: Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules.

2 0.30874655 30 acl-2011-Adjoining Tree-to-String Translation

Author: Yang Liu ; Qun Liu ; Yajuan Lu

Abstract: We introduce synchronous tree adjoining grammars (TAG) into tree-to-string translation, which converts a source tree to a target string. Without reconstructing TAG derivations explicitly, our rule extraction algorithm directly learns tree-to-string rules from aligned Treebank-style trees. As tree-to-string translation casts decoding as a tree parsing problem rather than parsing, the decoder still runs fast when adjoining is included. Less than 2 times slower, the adjoining tree-tostring system improves translation quality by +0.7 BLEU over the baseline system only allowing for tree substitution on NIST ChineseEnglish test sets.

3 0.28750649 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

Author: Xianchao Wu ; Takuya Matsuzaki ; Jun'ichi Tsujii

Abstract: In the present paper, we propose the effective usage of function words to generate generalized translation rules for forest-based translation. Given aligned forest-string pairs, we extract composed tree-to-string translation rules that account for multiple interpretations of both aligned and unaligned target function words. In order to constrain the exhaustive attachments of function words, we limit to bind them to the nearby syntactic chunks yielded by a target dependency parser. Therefore, the proposed approach can not only capture source-tree-to-target-chunk correspondences but can also use forest structures that compactly encode an exponential number of parse trees to properly generate target function words during decoding. Extensive experiments involving large-scale English-toJapanese translation revealed a significant im- provement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system.

4 0.21649493 61 acl-2011-Binarized Forest to String Translation

Author: Hao Zhang ; Licheng Fang ; Peng Xu ; Xiaoyun Wu

Abstract: Tree-to-string translation is syntax-aware and efficient but sensitive to parsing errors. Forestto-string translation approaches mitigate the risk of propagating parser errors into translation errors by considering a forest of alternative trees, as generated by a source language parser. We propose an alternative approach to generating forests that is based on combining sub-trees within the first best parse through binarization. Provably, our binarization forest can cover any non-consitituent phrases in a sentence but maintains the desirable property that for each span there is at most one nonterminal so that the grammar constant for decoding is relatively small. For the purpose of reducing search errors, we apply the synchronous binarization technique to forest-tostring decoding. Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). Consistent and significant gains are also shown in WMT 2010 in the English to German, French, Spanish and Czech tracks.

5 0.19052181 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

Author: Markos Mylonakis ; Khalil Sima'an

Abstract: While it is generally accepted that many translation phenomena are correlated with linguistic structures, employing linguistic syntax for translation has proven a highly non-trivial task. The key assumption behind many approaches is that translation is guided by the source and/or target language parse, employing rules extracted from the parse tree or performing tree transformations. These approaches enforce strict constraints and might overlook important translation phenomena that cross linguistic constituents. We propose a novel flexible modelling approach to introduce linguistic information of varying granularity from the source side. Our method induces joint probability synchronous grammars and estimates their parameters, by select- ing and weighing together linguistically motivated rules according to an objective function directly targeting generalisation over future data. We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.

6 0.18920942 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

7 0.1767787 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

8 0.16811095 180 acl-2011-Issues Concerning Decoding with Synchronous Context-free Grammar

9 0.1666497 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

10 0.15757641 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers

11 0.15471932 44 acl-2011-An exponential translation model for target language morphology

12 0.14272368 217 acl-2011-Machine Translation System Combination by Confusion Forest

13 0.14103349 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

14 0.13579561 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation

15 0.12327066 154 acl-2011-How to train your multi bottom-up tree transducer

16 0.12194211 166 acl-2011-Improving Decoding Generalization for Tree-to-String Translation

17 0.11549632 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

18 0.11289877 282 acl-2011-Shift-Reduce CCG Parsing

19 0.10948297 11 acl-2011-A Fast and Accurate Method for Approximate String Search

20 0.10708784 28 acl-2011-A Statistical Tree Annotator and Its Applications


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.244), (1, -0.239), (2, 0.099), (3, -0.104), (4, 0.021), (5, 0.02), (6, -0.267), (7, -0.061), (8, -0.09), (9, -0.071), (10, -0.091), (11, -0.029), (12, -0.021), (13, 0.074), (14, 0.07), (15, -0.053), (16, -0.0), (17, 0.048), (18, -0.022), (19, 0.025), (20, 0.024), (21, -0.04), (22, -0.001), (23, 0.083), (24, 0.039), (25, -0.077), (26, -0.031), (27, 0.01), (28, -0.023), (29, 0.02), (30, -0.004), (31, -0.092), (32, -0.032), (33, -0.009), (34, -0.078), (35, 0.044), (36, -0.01), (37, -0.11), (38, -0.032), (39, -0.014), (40, -0.053), (41, 0.019), (42, -0.011), (43, 0.004), (44, 0.003), (45, 0.011), (46, -0.02), (47, 0.016), (48, 0.072), (49, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97573662 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

Author: Ashish Vaswani ; Haitao Mi ; Liang Huang ; David Chiang

Abstract: Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules.

2 0.94981807 30 acl-2011-Adjoining Tree-to-String Translation

Author: Yang Liu ; Qun Liu ; Yajuan Lu

Abstract: We introduce synchronous tree adjoining grammars (TAG) into tree-to-string translation, which converts a source tree to a target string. Without reconstructing TAG derivations explicitly, our rule extraction algorithm directly learns tree-to-string rules from aligned Treebank-style trees. As tree-to-string translation casts decoding as a tree parsing problem rather than parsing, the decoder still runs fast when adjoining is included. Less than 2 times slower, the adjoining tree-tostring system improves translation quality by +0.7 BLEU over the baseline system only allowing for tree substitution on NIST ChineseEnglish test sets.

3 0.88979888 154 acl-2011-How to train your multi bottom-up tree transducer

Author: Andreas Maletti

Abstract: The local multi bottom-up tree transducer is introduced and related to the (non-contiguous) synchronous tree sequence substitution grammar. It is then shown how to obtain a weighted local multi bottom-up tree transducer from a bilingual and biparsed corpus. Finally, the problem of non-preservation of regularity is addressed. Three properties that ensure preservation are introduced, and it is discussed how to adjust the rule extraction process such that they are automatically fulfilled.

4 0.87018543 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

Author: Xianchao Wu ; Takuya Matsuzaki ; Jun'ichi Tsujii

Abstract: In the present paper, we propose the effective usage of function words to generate generalized translation rules for forest-based translation. Given aligned forest-string pairs, we extract composed tree-to-string translation rules that account for multiple interpretations of both aligned and unaligned target function words. In order to constrain the exhaustive attachments of function words, we limit to bind them to the nearby syntactic chunks yielded by a target dependency parser. Therefore, the proposed approach can not only capture source-tree-to-target-chunk correspondences but can also use forest structures that compactly encode an exponential number of parse trees to properly generate target function words during decoding. Extensive experiments involving large-scale English-toJapanese translation revealed a significant im- provement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system.

5 0.79899299 61 acl-2011-Binarized Forest to String Translation

Author: Hao Zhang ; Licheng Fang ; Peng Xu ; Xiaoyun Wu

Abstract: Tree-to-string translation is syntax-aware and efficient but sensitive to parsing errors. Forestto-string translation approaches mitigate the risk of propagating parser errors into translation errors by considering a forest of alternative trees, as generated by a source language parser. We propose an alternative approach to generating forests that is based on combining sub-trees within the first best parse through binarization. Provably, our binarization forest can cover any non-consitituent phrases in a sentence but maintains the desirable property that for each span there is at most one nonterminal so that the grammar constant for decoding is relatively small. For the purpose of reducing search errors, we apply the synchronous binarization technique to forest-tostring decoding. Combining the two techniques, we show that using a fast shift-reduce parser we can achieve significant quality gains in NIST 2008 English-to-Chinese track (1.3 BLEU points over a phrase-based system, 0.8 BLEU points over a hierarchical phrase-based system). Consistent and significant gains are also shown in WMT 2010 in the English to German, French, Spanish and Czech tracks.

6 0.78703755 173 acl-2011-Insertion Operator for Bayesian Tree Substitution Grammars

7 0.77453965 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

8 0.75390476 217 acl-2011-Machine Translation System Combination by Confusion Forest

9 0.74975234 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers

10 0.7264533 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

11 0.69076145 166 acl-2011-Improving Decoding Generalization for Tree-to-String Translation

12 0.67202592 180 acl-2011-Issues Concerning Decoding with Synchronous Context-free Grammar

13 0.6608429 330 acl-2011-Using Derivation Trees for Treebank Error Detection

14 0.62656677 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

15 0.61486143 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

16 0.59840673 28 acl-2011-A Statistical Tree Annotator and Its Applications

17 0.57977122 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations

18 0.57029277 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

19 0.56550395 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

20 0.54220802 296 acl-2011-Terminal-Aware Synchronous Binarization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.02), (17, 0.453), (26, 0.011), (28, 0.012), (37, 0.065), (39, 0.037), (41, 0.053), (55, 0.023), (59, 0.033), (72, 0.026), (91, 0.031), (96, 0.153)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97086227 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers

Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta

Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.

2 0.9373318 109 acl-2011-Effective Measures of Domain Similarity for Parsing

Author: Barbara Plank ; Gertjan van Noord

Abstract: It is well known that parsing accuracy suffers when a model is applied to out-of-domain data. It is also known that the most beneficial data to parse a given domain is data that matches the domain (Sekine, 1997; Gildea, 2001). Hence, an important task is to select appropriate domains. However, most previous work on domain adaptation relied on the implicit assumption that domains are somehow given. As more and more data becomes available, automatic ways to select data that is beneficial for a new (unknown) target domain are becoming attractive. This paper evaluates various ways to automatically acquire related training data for a given test set. The results show that an unsupervised technique based on topic models is effective – it outperforms random data selection on both languages exam- ined, English and Dutch. Moreover, the technique works better than manually assigned labels gathered from meta-data that is available for English. 1 Introduction and Motivation Previous research on domain adaptation has focused on the task of adapting a system trained on one domain, say newspaper text, to a particular new domain, say biomedical data. Usually, some amount of (labeled or unlabeled) data from the new domain was given which has been determined by a human. However, with the growth of the web, more and more data is becoming available, where each document “is potentially its own domain” (McClosky et al., 2010). It is not straightforward to determine – 1566 Gertjan van Noord University of Groningen The Netherlands G J M van Noord@ rug nl . . . . . which data or model (in case we have several source domain models) will perform best on a new (unknown) target domain. Therefore, an important issue that arises is how to measure domain similarity, i.e. whether we can find a simple yet effective method to determine which model or data is most beneficial for an arbitrary piece of new text. Moreover, if we had such a measure, a related question is whether it can tell us something more about what is actually meant by “domain”. So far, it was mostly arbitrarily used to refer to some kind of coherent unit (related to topic, style or genre), e.g.: newspaper text, biomedical abstracts, questions, fiction. Most previous work on domain adaptation, for instance Hara et al. (2005), McClosky et al. (2006), Blitzer et al. (2006), Daum e´ III (2007), sidestepped this problem of automatic domain selection and adaptation. For parsing, to our knowledge only one recent study has started to examine this issue (McClosky et al., 2010) we will discuss their approach in Section 2. Rather, an implicit assumption of all of these studies is that domains are given, i.e. that they are represented by the respective corpora. Thus, a corpus has been considered a homogeneous unit. As more data is becoming available, it is unlikely that – domains will be ‘given’ . Moreover, a given corpus might not always be as homogeneous as originally thought (Webber, 2009; Lippincott et al., 2010). For instance, recent work has shown that the well-known Penn Treebank (PT) Wall Street Journal (WSJ) actually contains a variety of genres, including letters, wit and short verse (Webber, 2009). In this study we take a different approach. Rather than viewing a given corpus as a monolithic entity, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s566–1576, we break it down to the article-level and disregard corpora boundaries. Given the resulting set of documents (articles), we evaluate various ways to automatically acquire related training data for a given test set, to find answers to the following questions: • Given a pool of data (a collection of articles fGriovmen nun ak pnooowln o domains) caonldle a test article, eiss there a way to automatically select data that is relevant for the new domain? If so: • Which similarity measure is good for parsing? • How does it compare to human-annotated data? • Is the measure also useful for other languages Iasnd th/oer mtaesakssu?r To this end, we evaluate measures of domain similarity and feature representations and their impact on dependency parsing accuracy. Given a collection of annotated articles, and a new article that we want to parse, we want to select the most similar articles to train the best parser for that new article. In the following, we will first compare automatic measures to human-annotated labels by examining parsing performance within subdomains of the Penn Treebank WSJ. Then, we extend the experiments to the domain adaptation scenario. Experiments were performed on two languages: English and Dutch. The empirical results show that a simple measure based on topic distributions is effective for both languages and works well also for Part-of-Speech tagging. As the approach is based on plain surfacelevel information (words) and it finds related data in a completely unsupervised fashion, it can be easily applied to other tasks or languages for which annotated (or automatically annotated) data is available. 2 Related Work The work most related to ours is McClosky et al. (2010). They try to find the best combination of source models to parse data from a new domain, which is related to Plank and Sima’an (2008). In the latter, unlabeled data was used to create several parsers by weighting trees in the WSJ according to their similarity to the subdomain. McClosky et al. (2010) coined the term multiple source domain adaptation. Inspired by work on parsing accuracy 1567 prediction (Ravi et al., 2008), they train a linear regression model to predict the best (linear interpolation) of source domain models. Similar to us, McClosky et al. (2010) regard a target domain as mixture of source domains, but they focus on phrasestructure parsing. Furthermore, our approach differs from theirs in two respects: we do not treat source corpora as one entity and try to mix models, but rather consider articles as base units and try to find subsets of related articles (the most similar articles); moreover, instead of creating a supervised model (in their case to predict parsing accuracy), our approach is ‘simplistic’ : we apply measures of domain simi- larity directly (in an unsupervised fashion), without the necessity to train a supervised model. Two other related studies are (Lippincott et al., 2010; Van Asch and Daelemans, 2010). Van Asch and Daelemans (2010) explore a measure of domain difference (Renyi divergence) between pairs of domains and its correlation to Part-of-Speech tagging accuracy. Their empirical results show a linear correlation between the measure and the performance loss. Their goal is different, but related: rather than finding related data for a new domain, they want to estimate the loss in accuracy of a PoS tagger when applied to a new domain. We will briefly discuss results obtained with the Renyi divergence in Section 5.1. Lippincott et al. (2010) examine subdomain variation in biomedicine corpora and propose awareness of NLP tools to such variation. However, they did not yet evaluate the effect on a practical task, thus our study is somewhat complementary to theirs. The issue of data selection has recently been examined for Language Modeling (Moore and Lewis, 2010). A subset of the available data is automatically selected as training data for a Language Model based on a scoring mechanism that compares cross- entropy scores. Their approach considerably outperformed random selection and two previous proposed approaches both based on perplexity scoring.1 3 Measures of Domain Similarity 3.1 Measuring Similarity Automatically Feature Representations A similarity function may be defined over any set of events that are con1We tested data selection by perplexity scoring, but found the Language Models too small to be useful in our setting. sidered to be relevant for the task at hand. For parsing, these might be words, characters, n-grams (of words or characters), Part-of-Speech (PoS) tags, bilexical dependencies, syntactic rules, etc. However, to obtain more abstract types such as PoS tags or dependency relations, one would first need to gather respective labels. The necessary tools for this are again trained on particular corpora, and will suffer from domain shifts, rendering labels noisy. Therefore, we want to gauge the effect of the simplest representation possible: plain surface characteristics (unlabeled text). This has the advantage that we do not need to rely on additional supervised tools; moreover, it is interesting to know how far we can get with this level of information only. We examine the following feature representations: relative frequencies of words, relative frequencies of character tetragrams, and topic models. Our motivation was as follows. Relative frequencies of words are a simple and effective representation used e.g. in text classification (Manning and Sch u¨tze, 1999), while character n-grams have proven successful in genre classification (Wu et al., 2010). Topic models (Blei et al., 2003; Steyvers and Griffiths, 2007) can be considered an advanced model over word distributions: every article is represented by a topic distribution, which in turn is a distribution over words. Similarity between documents can be measured by comparing topic distributions. Similarity Functions There are many possible similarity (or distance) functions. They fall broadly into two categories: probabilistically-motivated and geometrically-motivated functions. The similarity functions examined in this study will be described in the following. The Kullback-Leibler (KL) divergence D(q| |r) is a cTlahsesic Kaull measure oibfl ‘edri s(KtaLn)ce d’i2v ebregtweneceen D Dtw(oq probability distributions, and is defined as: D(q| |r) = Pyq(y)logrq((yy)). It is a non-negative, additive, aPsymmetric measure, and 0 iff the two distributions are identical. However, the KL-divergence is undefined if there exists an event y such that q(y) > 0 but r(y) = 0, which is a property that “makes it unsuitable for distributions derived via maximumlikelihood estimates” (Lee, 2001). 2It is not a proper distance metric since it is asymmetric. 1568 One option to overcome this limitation is to apply smoothing techniques to gather non-zero estimates for all y. The alternative, examined in this paper, is to consider an approximation to the KL divergence, such as the Jensen-Shannon (JS) divergence (Lin, 1991) and the skew divergence (Lee, 2001). The Jensen-Shannon divergence, which is symmetric, computes the KL-divergence between q, r, and the average between the two. We use the JS divergence as defined in Lee (2001): JS(q, r) = [D(q| |avg(q, r)) + D(r| |avg(q, r))] . The asymm[eDtr(icq |s|akvewg( divergence sα, proposed by Lee (2001), mixes one distribution with the other by a degree de- 21 fined by α ∈ [0, 1) : sα (q, r, α) = D(q| |αr + (1 α)q). Ays α α approaches 1, rt,hαe )sk =ew D divergence approximates the KL-divergence. An alternative way to measure similarity is to consider the distributions as vectors and apply geometrically-motivated distance functions. This family of similarity functions includes the cosine cos(q, r) = qq(y) · r(y)/ | |q(y) | | | |r(y) | |, euclidean − euc(q,r) = qPy(q(y) − r(y))2 and variational (also known asq LP1 or MPanhattan) distance function, defined as var(q, r) = Py |q(y) − r(y) |. 3.2 Human-annotatePd data In contrast to the automatic measures devised in the previous section, we might have access to human annotated data. That is, use label information such as topic or genre to define the set of similar articles. Genre For the Penn Treebank (PT) Wall Street Journal (WSJ) section, more specifically, the subset available in the Penn Discourse Treebank, there exists a partition of the data by genre (Webber, 2009). Every article is assigned one of the following genre labels: news, letters, highlights, essays, errata, wit and short verse, quarterly progress reports, notable and quotable. This classification has been made on the basis of meta-data (Webber, 2009). It is wellknown that there is no meta-data directly associated with the individual WSJ files in the Penn Treebank. However, meta-data can be obtained by looking at the articles in the ACL/DCI corpus (LDC99T42), and a mapping file that aligns document numbers of DCI (DOCNO) to WSJ keys (Webber, 2009). An example document is given in Figure 1. The metadata field HL contains headlines, SO source info, and the IN field includes topic markers.

3 0.93462324 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

Author: Gunter Neumann ; Sven Schmeier

Abstract: We present a mobile touchable application for online topic graph extraction and exploration of web content. The system has been implemented for operation on an iPad. The topic graph is constructed from N web snippets which are determined by a standard search engine. We consider the extraction of a topic graph as a specific empirical collocation extraction task where collocations are extracted between chunks. Our measure of association strength is based on the pointwise mutual information between chunk pairs which explicitly takes their distance into account. An initial user evaluation shows that this system is especially helpful for finding new interesting information on topics about which the user has only a vague idea or even no idea at all.

4 0.90859842 118 acl-2011-Entrainment in Speech Preceding Backchannels.

Author: Rivka Levitan ; Agustin Gravano ; Julia Hirschberg

Abstract: In conversation, when speech is followed by a backchannel, evidence of continued engagement by one’s dialogue partner, that speech displays a combination of cues that appear to signal to one’s interlocutor that a backchannel is appropriate. We term these cues backchannel-preceding cues (BPC)s, and examine the Columbia Games Corpus for evidence of entrainment on such cues. Entrainment, the phenomenon of dialogue partners becoming more similar to each other, is widely believed to be crucial to conversation quality and success. Our results show that speaking partners entrain on BPCs; that is, they tend to use similar sets of BPCs; this similarity increases over the course of a dialogue; and this similarity is associated with measures of dialogue coordination and task success. 1

same-paper 5 0.9064393 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

Author: Ashish Vaswani ; Haitao Mi ; Liang Huang ; David Chiang

Abstract: Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules.

6 0.87244105 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

7 0.71548617 180 acl-2011-Issues Concerning Decoding with Synchronous Context-free Grammar

8 0.70684487 30 acl-2011-Adjoining Tree-to-String Translation

9 0.6738103 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

10 0.6446541 141 acl-2011-Gappy Phrasal Alignment By Agreement

11 0.64062846 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

12 0.63194442 154 acl-2011-How to train your multi bottom-up tree transducer

13 0.62788343 296 acl-2011-Terminal-Aware Synchronous Binarization

14 0.62745655 61 acl-2011-Binarized Forest to String Translation

15 0.60855103 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

16 0.59890151 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

17 0.59734291 16 acl-2011-A Joint Sequence Translation Model with Integrated Reordering

18 0.59649545 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

19 0.59598994 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

20 0.59457791 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results