emnlp emnlp2011 emnlp2011-118 knowledge-graph by maker-knowledge-mining

118 emnlp-2011-SMT Helps Bitext Dependency Parsing


Source: pdf

Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li

Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ijpo , , , , , Abstract We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. [sent-14, score-0.642]

2 Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. [sent-15, score-0.807]

3 Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. [sent-16, score-1.113]

4 However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. [sent-17, score-1.172]

5 To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. [sent-18, score-1.069]

6 1 Introduction Recently there have been several studies aiming to improve the performance of parsing bilingual texts (bitexts) (Smith and Smith, 2004; Burkett and Klein, 2008; Huang et al. [sent-22, score-0.581]

7 In bitext parsing, we can use the information based on “bilingual constraints” (Burkett and Klein, 2008), which do not exist in monolingual sentences. [sent-26, score-0.279]

8 Most previous studies rely on bilingual treebanks to provide bilingual constraints for bitext parsing. [sent-28, score-1.329]

9 Their method uses bilingual treebanks that have human-annotated tree structures on both sides. [sent-30, score-0.625]

10 It uses another type of bilingual treebanks that have tree structures on the source sentences and their human-translated sentences. [sent-33, score-0.699]

11 (2010) also used bilingual treebanks and made use of tree structures on the target side. [sent-35, score-0.69]

12 However, the bilingual treebanks are hard to obtain, partly because of the high cost of human translation. [sent-36, score-0.625]

13 On the other hand, many large-scale monolingual treebanks exist, such as the Penn English Treebank (PTB) (Marcus et al. [sent-38, score-0.259]

14 In this paper, we propose a bitext parsing approach in which we produce the bilingual constraints on existing monolingual treebanks with the help of SMT systems. [sent-40, score-1.052]

15 In our approach, we first use an SMT system to translate the sentences of a source monolingual treebank into the target language. [sent-42, score-0.396]

16 Then, the target sentences are parsed by a parser trained on a target monolingual treebank. [sent-43, score-0.362]

17 We then obtain a bilingual treebank that has human annotated trees on the source side and auto-generated trees on the target side. [sent-44, score-0.768]

18 c e2th0o1d1s A ins Nocaitautiroanl L foarn Cguoamgpeu Ptartoicoensaslin Lgin,g puaigsetisc 7s3–83, target side are not perfect, we expect that we can improve bitext parsing performance by using this newly auto-generated bilingual treebank. [sent-47, score-0.831]

19 Then we can produce a set of bilingual constraints between the two sides. [sent-49, score-0.581]

20 To overcome this problem, we verify the constraints by using large-scale unannotated monolingual sentences and bilingual sentence pairs. [sent-51, score-0.915]

21 Finally, we design a set of bilingual features based on the verified results for parsing models. [sent-52, score-0.856]

22 Our approach uses existing resources including monolingual treebanks to train monolingual parsers on both sides, bilingual unannotated data to train SMT systems and to extract bilingual subtrees, and target monolingual unannotated data to extract monolingual subtrees. [sent-53, score-2.018]

23 In summary, we make the following contributions: • We propose an approach that uses an autogenerated bilingual atcreheb tahnakt sreasthe arn tuhtaonhuman-annotated bilingual treebanks used in previous studies (Burkett and Klein, 2008; Huang et al. [sent-54, score-1.147]

24 The auto-generated bilingual treebank is built with the help of SMT systems. [sent-57, score-0.621]

25 • • We verify the unreliable constraints by using tWhee existing large-scale eun coannsntortaaintetds dbayta u sainndg design a set of effective bilingual features over the verified results. [sent-58, score-0.933]

26 Section 4 describes a set of bilingual features based on the bilingual constraints and Section 5 describes how to use large-scale unannotated data to verify the bilingual constraints and define another set of bilingual features based on the verified results. [sent-70, score-2.669]

27 2 Motivation Here, bitext parsing is the task of parsing source sentences with the help of their corresponding translations. [sent-73, score-0.345]

28 ttaa ggaaoodduu ppiinnggjjiiaa llee yyuu lliippeenngg zzoonnggllii (b) ddee hhuuiittaann jjiieegguuoo Figure 1: Input and output of our approach In bitext parsing, some ambiguities exist on the source side, but they may be unambiguous on the target side. [sent-122, score-0.238]

29 ta xiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilai de liliang he jiqiao PN VV DT! [sent-153, score-1.548]

30 ta xiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilai de liliang he jiqiao (b) ? [sent-178, score-1.548]

31 The figure shows that the translation can provide useful bilingual constraints. [sent-203, score-0.553]

32 However, there are few humanannotated bilingual treebanks and the existing bilingual treebanks are usually small. [sent-207, score-1.279]

33 So we want to use existing resources to generate a bilingual treebank with the help of SMT systems. [sent-211, score-0.621]

34 We hope to improve source side parsing by using this newly built bilingual treebank. [sent-212, score-0.726]

35 ta xiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilai de liliang he jiqiao He hoped that all the athletes would fully demonstrate the strength and skill that they cultivate daily Figure 3: Example of human translation ? [sent-247, score-1.79]

36 ta xiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilai de liliang he jiqiao he! [sent-281, score-1.548]

37 skil s Figure 4: Example of Moses translation Figure 4 shows an example of a translation using a Moses-based system, where the target sentence is parsed by a monolingual target parser. [sent-299, score-0.433]

38 From this example, although the sentences and parse trees on the target side are not perfect, we still can explore useful information to improve bitext parsing. [sent-303, score-0.274]

39 In this paper, we focus on how to design a method to verify such unreliable bilingual constraints. [sent-304, score-0.599]

40 One is monolingual features based on the source sentences. [sent-311, score-0.24]

41 The other one is bilingual features (described in Sections 4 and 5) that consider the bilingual constraints. [sent-313, score-1.078]

42 We call the parser with the monolingual features on the source side Parsers, and the parser with the monolingual features on the target side Parsert. [sent-314, score-0.669]

43 4 Original bilingual features In this paper, we generate two types of bilingual features, original and verified bilingual features. [sent-315, score-1.841]

44 The original bilingual features (described in this section) are based on the bilingual constraints without being verified by large-scale unannotated data. [sent-316, score-1.455]

45 And the verified bilingual features (described in Section 5) are based on the bilingual constraints verified by using large-scale unannotated data. [sent-317, score-1.696]

46 1 Auto-generated bilingual treebank Assuming that we have monolingual treebanks on the source side, an SMT system that can translate the source sentences into the target language, and a Parsert trained on the target monolingual treebank. [sent-319, score-1.292]

47 We first translate the sentences of the source monolingual treebank into the target language using the SMT system. [sent-320, score-0.396]

48 We define a binary function for this bilingual constraint: Fbn(rsn : rtk), where n and k refers to the types of the dependencies (2 for bigram and 3 for trigram). [sent-335, score-0.64]

49 For example, in rs2 : rt3, rs2 is a bigram dependency on the source side and rt3 is a trigram dependency on the target side. [sent-336, score-0.449]

50 1 Bigram constraint function: Fb2 For rs2, we consider two types of bilingual constraints. [sent-339, score-0.568]

51 taxiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilaide! [sent-399, score-0.951]

52 skil s Figure 5: Example of bilingual constraints (2to2) ? [sent-420, score-0.639]

53 ta xiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilai de liliang he jiqiao he! [sent-468, score-1.548]

54 skil s Figure 6: Example of bilingual constraints (2to3) 4. [sent-486, score-0.639]

55 taxiwang quanti yundongyuan chongfeng fahui pingshi peiyu qilaide! [sent-543, score-0.951]

56 skil s Figure 7: Example of bilingual constraints (3to3) 4. [sent-564, score-0.639]

57 4 Original bilingual features We define original bilingual features based on the bilingual constraint functions and the bilingual reordering function. [sent-573, score-2.202]

58 h F Firb s2o ti-,Do Tradibreli r,Ff1er:aoOitur geisnalh SbF ielcbi no3 g,nuD da-ilor fe,idaFetur oefi satures We use an example to show how to generate the original bilingual features in practice. [sent-576, score-0.556]

59 In Figure 4, we want to define the bilingual features for the bigram dependency (rs2) between “发挥(fahui)” and “技巧(jiqiao)”. [sent-577, score-0.691]

60 f 5 Verified bilingual features However, because the bilingual treebank is generated automatically, using the bilingual constraints alone is not reliable. [sent-581, score-1.728]

61 More specifically, rtk of the constraint is verified by checking a list of target monolingual subtrees and rsn : rtk is verified by checking a list of bilingual subtrees. [sent-583, score-1.952]

62 The subtrees are extracted from the large-scale unannotated data. [sent-584, score-0.238]

63 The basic idea is as follows: if the dependency structures of a bilingual constraint can be found in the list of the target monolingual subtrees 1For the second order features, Dir is the combination of the directions of two dependencies. [sent-585, score-1.008]

64 or bilingual subtrees, this constraint will probably be reliable. [sent-586, score-0.568]

65 We first parse the large-scale unannotated monolingual and bilingual data. [sent-587, score-0.755]

66 Subsequently, we extract the monolingual and bilingual subtrees from the parsed data. [sent-588, score-0.866]

67 We then verify the bilingual constraints using the extracted subtrees. [sent-589, score-0.658]

68 Finally, we generate the bilingual features based on the verified results for the parsing models. [sent-590, score-0.856]

69 (2009) proposed a simple method to extract subtrees from large-scale monolingual data and used them as features to improve monolingual parsing. [sent-595, score-0.507]

70 Following their method, we parse large unannotated data with the Parsert and obtain the subtree list (STt) on the target side. [sent-596, score-0.229]

71 We extract two types of subtrees: bigram (two words) subtree and trigram (three words) subtree. [sent-597, score-0.243]

72 Figure 8: Example of monolingual subtree extraction From the dependency tree in Figure 8-(a), we obtain the subtrees shown in Figure 8-(b) where the first three are bigram subtrees and the last one is a trigram subtree. [sent-606, score-0.779]

73 After extraction, we obtain the subtree list STt that includes two sets, one for bigram subtrees, and the other one for trigram subtrees. [sent-607, score-0.243]

74 2 Verified target constraint function: Fvt(rtk) We use the extracted target subtrees to verify the rtk of the bilingual constraints. [sent-617, score-1.167]

75 If the rtk is included in STt, function Fvt(rtk) = Type(rtk), otherwise Fvt(rtk) = ZERO. [sent-619, score-0.231]

76 3 Bilingual subtrees We extract bilingual subtrees from a bilingual corpus, which is parsed by the Parsers and Parsert on both sides. [sent-624, score-1.393]

77 We extract three types of bilingual subtrees: bigram-bigram (stbi22), bigram-trigram (stbi23), and trigram-trigram (stbi33) subtrees. [sent-625, score-0.522]

78 For example, stbi22 consists of a bigram subtree on the source side and a bigram subtree on the target side. [sent-626, score-0.505]

79 tu dent Figure 9: Example of bilingual subtree extraction From the dependency tree in Figure 9-(a), we obtain the bilingual subtrees shown in Figure 9(b). [sent-643, score-1.35]

80 Figure 9-(b) shows the extracted bigram-bigram bilingual subtrees. [sent-644, score-0.522]

81 4 Verified bilingual constraint function: Fvb(rbink) We use the extracted bilingual subtrees to verify the rsn : rtk (rbink in short) of the bilingual constraints. [sent-649, score-2.139]

82 rsn and rtk form a candidate bilingual subtree stbink. [sent-650, score-0.898]

83 2 Verified bilingual features Then, we define another set of bilingual features by combining the verified constraint functions. [sent-653, score-1.399]

84 We call these bilingual features ‘verified bilingual features’ . [sent-654, score-1.078]

85 Table 2 lists the verified bilingual features used in our experiments, where each line defines a feature template that is a combination of functions. [sent-655, score-0.823]

86 We use an example to show how to generate the verified bilingual features in practice. [sent-656, score-0.797]

87 In Figure 4, we want to define the verified features for the bigram dependency (rs2) between “发挥(fahui)” and “技 巧(jiqiao)”. [sent-657, score-0.41]

88 Suppose we can find rt3 in STt with label MF and can not find the candidate bilingual subtree in STbi. [sent-660, score-0.609]

89 Note that we did not use human translation on the English side of this bilingual treebank to train our new parsers. [sent-668, score-0.684]

90 To extract bilingual subtrees, we used the FBIS corpus and an additional bilingual corpus containing 800,000 sentence pairs from the training data of NIST MT08 evaluation campaign. [sent-712, score-1.044]

91 We used PAG to refer to our parsers trained on the auto-generated bilingual treebank. [sent-727, score-0.55]

92 29 points for the second-order model by adding the verified bilingual features. [sent-739, score-0.763]

93 If we used the original bilingual features (PAGo), the system dropped 0. [sent-743, score-0.556]

94 This indicated that the verified bilingual constraints did provide useful information for the parsing models. [sent-746, score-0.911]

95 These facts indicated that our approach obtained the benefits from the verified constraints, while using the bilingual constraints alone was enough for ORACLE. [sent-751, score-0.852]

96 Types HA and AG denote training on human-annotated and auto-generated bilingual treebanks respectively. [sent-797, score-0.625]

97 (2009), Chen2010BI refers to the result of using bilingual features in Chen et al. [sent-800, score-0.597]

98 (2010), our approach used an auto-generated bilingual treebank while theirs used a human-annotated bilingual treebank. [sent-809, score-1.113]

99 Although we trained our parser on an auto- generated bilingual treebank, we achieved an accuracy comparable to the systems trained on humanannotated bilingual treebanks on the standard test data. [sent-814, score-1.201]

100 Moreover, our approach continued to provide improvement over the baseline systems when we used a much larger monolingual treebank (over 50,000 sentences) where target human translations are not available and very hard to construct. [sent-815, score-0.29]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bilingual', 0.522), ('jiqiao', 0.26), ('fahui', 0.245), ('verified', 0.241), ('rtk', 0.231), ('subtrees', 0.161), ('peiyu', 0.159), ('monolingual', 0.156), ('quanti', 0.144), ('yundongyuan', 0.144), ('fvt', 0.13), ('liliang', 0.13), ('rbink', 0.13), ('bitext', 0.123), ('chongfeng', 0.115), ('fvb', 0.115), ('pingshi', 0.115), ('treebanks', 0.103), ('parsert', 0.101), ('qilai', 0.101), ('xiwang', 0.101), ('subtree', 0.087), ('pag', 0.087), ('stt', 0.087), ('trigram', 0.079), ('smt', 0.079), ('verify', 0.077), ('unannotated', 0.077), ('bigram', 0.077), ('chen', 0.075), ('athletes', 0.073), ('burkett', 0.07), ('treebank', 0.069), ('fbis', 0.068), ('target', 0.065), ('ctb', 0.063), ('side', 0.062), ('constraints', 0.059), ('parsing', 0.059), ('dependency', 0.058), ('diri', 0.058), ('rsn', 0.058), ('skil', 0.058), ('skill', 0.058), ('source', 0.05), ('huang', 0.05), ('bllip', 0.05), ('constraint', 0.046), ('wenliang', 0.045), ('cultivate', 0.043), ('righti', 0.043), ('bitexts', 0.042), ('ful', 0.042), ('refers', 0.041), ('vv', 0.039), ('skills', 0.037), ('alignment', 0.037), ('strength', 0.037), ('kazama', 0.035), ('klein', 0.035), ('features', 0.034), ('secondorder', 0.034), ('ta', 0.034), ('hope', 0.033), ('translate', 0.032), ('chinese', 0.032), ('country', 0.031), ('translation', 0.031), ('indicated', 0.03), ('mcdonald', 0.03), ('kentaro', 0.03), ('help', 0.03), ('uas', 0.029), ('latest', 0.029), ('gtran', 0.029), ('humanannotated', 0.029), ('pago', 0.029), ('qilaide', 0.029), ('taxiwang', 0.029), ('yujie', 0.029), ('play', 0.028), ('carreras', 0.028), ('fro', 0.028), ('links', 0.028), ('parsers', 0.028), ('parsed', 0.027), ('template', 0.026), ('ichi', 0.026), ('aligner', 0.026), ('singapore', 0.026), ('ptb', 0.025), ('parser', 0.025), ('yoshimasa', 0.025), ('kruengkrai', 0.025), ('yiou', 0.025), ('twhee', 0.025), ('span', 0.025), ('denero', 0.024), ('checks', 0.024), ('sentences', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 118 emnlp-2011-SMT Helps Bitext Dependency Parsing

Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li

Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.

2 0.1882188 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

3 0.12130979 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

4 0.11763655 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax

Author: Wenbin Jiang ; Qun Liu ; Yajuan Lv

Abstract: We propose a relaxed correspondence assumption for cross-lingual projection of constituent syntax, which allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers.

5 0.098109812 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

Author: Kevin Gimpel ; Noah A. Smith

Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.

6 0.096924864 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

7 0.096312664 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers

8 0.094618775 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

9 0.089800544 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

10 0.088766165 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

11 0.08802513 136 emnlp-2011-Training a Parser for Machine Translation Reordering

12 0.083310291 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

13 0.079108156 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

14 0.07009238 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser

15 0.069272399 3 emnlp-2011-A Correction Model for Word Alignments

16 0.068327792 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

17 0.064805657 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

18 0.062385779 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

19 0.061035004 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

20 0.060360979 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.197), (1, 0.144), (2, 0.048), (3, -0.025), (4, -0.035), (5, 0.035), (6, -0.01), (7, 0.044), (8, -0.132), (9, 0.003), (10, -0.004), (11, 0.039), (12, 0.139), (13, 0.142), (14, 0.112), (15, 0.062), (16, 0.047), (17, -0.178), (18, -0.139), (19, -0.096), (20, -0.065), (21, -0.215), (22, -0.177), (23, 0.046), (24, 0.046), (25, -0.242), (26, 0.076), (27, 0.005), (28, 0.025), (29, 0.229), (30, 0.002), (31, 0.014), (32, -0.108), (33, -0.014), (34, 0.165), (35, 0.009), (36, -0.013), (37, -0.059), (38, -0.075), (39, 0.007), (40, 0.064), (41, -0.023), (42, -0.017), (43, -0.004), (44, -0.037), (45, 0.067), (46, 0.017), (47, -0.027), (48, 0.011), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95196038 118 emnlp-2011-SMT Helps Bitext Dependency Parsing

Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li

Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.

2 0.79557669 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

3 0.6063605 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

Author: Xabier Saralegi ; Iker Manterola ; Inaki San Vicente

Abstract: An A-C bilingual dictionary can be inferred by merging A-B and B-C dictionaries using B as pivot. However, polysemous pivot words often produce wrong translation candidates. This paper analyzes two methods for pruning wrong candidates: one based on exploiting the structure of the source dictionaries, and the other based on distributional similarity computed from comparable corpora. As both methods depend exclusively on easily available resources, they are well suited to less resourced languages. We studied whether these two techniques complement each other given that they are based on different paradigms. We also researched combining them by looking for the best adequacy depending on various application scenarios. ,

4 0.59188008 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

Author: Jagadeesh Jagarlamudi ; Raghavendra Udupa ; Hal Daume III ; Abhijit Bhole

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores. Our experimental results on the task of aligning comparable documents shows the efficacy of sparse covariance matrices on two data sets from two different language pairs.

5 0.49272576 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

6 0.47646919 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax

7 0.40152305 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

8 0.36298329 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers

9 0.34375179 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

10 0.29955336 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

11 0.29493752 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser

12 0.29098135 74 emnlp-2011-Inducing Sentence Structure from Parallel Corpora for Reordering

13 0.27650306 3 emnlp-2011-A Correction Model for Word Alignments

14 0.26056704 38 emnlp-2011-Data-Driven Response Generation in Social Media

15 0.25812203 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

16 0.25760442 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

17 0.24806891 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

18 0.24754862 136 emnlp-2011-Training a Parser for Machine Translation Reordering

19 0.24746679 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

20 0.24614774 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.106), (36, 0.017), (37, 0.019), (45, 0.05), (53, 0.013), (54, 0.018), (57, 0.025), (62, 0.017), (64, 0.06), (65, 0.422), (66, 0.015), (69, 0.014), (79, 0.04), (82, 0.021), (87, 0.021), (90, 0.015), (96, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68868047 118 emnlp-2011-SMT Helps Bitext Dependency Parsing

Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li

Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.

2 0.6669246 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

Author: Jun Xie ; Haitao Mi ; Qun Liu

Abstract: Dependency structure, as a first step towards semantics, is believed to be helpful to improve translation quality. However, previous works on dependency structure based models typically resort to insertion operations to complete translations, which make it difficult to specify ordering information in translation rules. In our model of this paper, we handle this problem by directly specifying the ordering information in head-dependents rules which represent the source side as head-dependents relations and the target side as strings. The head-dependents rules require only substitution operation, thus our model requires no heuristics or separate ordering models of the previous works to control the word order of translations. Large-scale experiments show that our model performs well on long distance reordering, and outperforms the state- of-the-art constituency-to-string model (+1.47 BLEU on average) and hierarchical phrasebased model (+0.46 BLEU on average) on two Chinese-English NIST test sets without resort to phrases or parse forest. For the first time, a source dependency structure based model catches up with and surpasses the state-of-theart translation models.

3 0.34275243 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

4 0.34179613 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

Author: Jiajun Zhang ; Feifei Zhai ; Chengqing Zong

Abstract: Due to its explicit modeling of the grammaticality of the output via target-side syntax, the string-to-tree model has been shown to be one of the most successful syntax-based translation models. However, a major limitation of this model is that it does not utilize any useful syntactic information on the source side. In this paper, we analyze the difficulties of incorporating source syntax in a string-totree model. We then propose a new way to use the source syntax in a fuzzy manner, both in source syntactic annotation and in rule matching. We further explore three algorithms in rule matching: 0-1 matching, likelihood matching, and deep similarity matching. Our method not only guarantees grammatical output with an explicit target tree, but also enables the system to choose the proper translation rules via fuzzy use of the source syntax. Our extensive experiments have shown significant improvements over the state-of-the-art string-to-tree system. 1

5 0.34003454 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

Author: Yang Gao ; Philipp Koehn ; Alexandra Birch

Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.

6 0.32343754 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction

7 0.31949005 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

8 0.31717765 136 emnlp-2011-Training a Parser for Machine Translation Reordering

9 0.31537208 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

10 0.31305364 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives

11 0.31188425 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

12 0.31107366 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

13 0.30920893 79 emnlp-2011-Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction

14 0.3088735 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers

15 0.30877227 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

16 0.30796254 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

17 0.30692109 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

18 0.30571532 128 emnlp-2011-Structured Relation Discovery using Generative Models

19 0.30541641 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

20 0.30531096 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference