emnlp emnlp2013 emnlp2013-15 knowledge-graph by maker-knowledge-mining

15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

Source: pdf

Author: Kevin Gimpel ; Dhruv Batra ; Chris Dyer ; Gregory Shakhnarovich

Abstract: This paper addresses the problem of producing a diverse set of plausible translations. We present a simple procedure that can be used with any statistical machine translation (MT) system. We explore three ways of using diverse translations: (1) system combination, (2) discriminative reranking with rich features, and (3) a novel post-editing scenario in which multiple translations are presented to users. We find that diversity can improve performance on these tasks, especially for sentences that are difficult for MT.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper addresses the problem of producing a diverse set of plausible translations. [sent-2, score-0.495]

2 We explore three ways of using diverse translations: (1) system combination, (2) discriminative reranking with rich features, and (3) a novel post-editing scenario in which multiple translations are presented to users. [sent-4, score-0.94]

3 1 Introduction From the perspective of user interaction, the ideal machine translator is an agent that reads documents in one language and produces accurate, high quality translations in another. [sent-6, score-0.313]

4 Multiple solutions are also used for reranking (Collins, 2000; Shen and Joshi, 2003; 1100 Collins and Koo, 2005; Charniak and Johnson, 2005), tuning (Och, 2003), minimum Bayes risk decoding (Kumar and Byrne, 2004), and system combination (Rosti et al. [sent-15, score-0.466]

5 Unfortunately, M-best lists are a poor surrogate for structured output spaces (Finkel et al. [sent-20, score-0.371]

6 In MT, for example, many translations on M-best lists are extremely similar, often differing only by a single punctuation mark or minor morphological variation. [sent-22, score-0.561]

7 (2012), which produces diverse M-best solutions from a probabilistic model using a generic dissimilarity function ∆(·, ·) tmhaotd specifies ah gowen tewrioc sd oilsustimionilsa rdiiftfyer f. [sent-31, score-0.851]

8 u nOcutrio finrs ∆t contribution is a family of dissimilarity functions for MT that admit simple algorithms for generating diverse translations. [sent-32, score-0.792]

9 Other contributions are empiri- × cal: we show that diverse translations can lead to improvements for system combination and discriminative reranking. [sent-33, score-0.85]

10 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t0ic0s–1 1 , post-editing evaluation in order to measure whether diverse translations can help users make sense of noisy MT output. [sent-36, score-0.775]

11 We find that diverse translations can help post-editors produce better outputs for sentences that are the most difficult for MT. [sent-37, score-0.784]

12 3 Diversity in Machine Translation We now address the task of producing a set of diverse high-scoring translations. [sent-50, score-0.469]

13 , 2012) that constructs diverse lists via a greedy iterative procedure as follows. [sent-53, score-0.802]

14 On the m-th iteration, the m-th best (diverse) translation is obtained as hym, hmi = mX−1 ahyrg,hmi∈aTxx w|φ(x,y,h) + jX=1λj∆(yj,y) (2) 1101 where is a dissimilarity function and λj is the weight placed on dissimilarity to previous translation j relative to the model score. [sent-56, score-0.961]

15 (2) is a Lagrangian relaxation for an intractable constrained objective specifying a minimum dissimilarity ∆min between translations in the list, i. [sent-62, score-0.647]

16 Instead of setting t,hye dissimilarity threshold ∆min, we set the weights λj. [sent-66, score-0.323]

17 ∆ Note that if the dissimilarity function factors across the parts of the output variables hy, hi in the same way as tthse o ffe tahteur ouest φ, vtharenia bthlees same id einco thdeing algorithm can be used as for Eq. [sent-70, score-0.388]

18 2 Dissimilarity Functions for MT When designing a dissimilarity function ∆(·, ·) for MT, we wsiagnnti ntog c ao dnissisdimer lvaarritiayti ofunn bctoitohn i ∆n (in·,d·i)vi fodrual word choice and longer-range sentence structure. [sent-74, score-0.35]

19 We propose a dissimilarity function that simply counts the number of times any n-gram is present in both translations, then negates. [sent-76, score-0.35]

20 The dissimilarity terms can simply be incorporated as an additional language model in ARPA format that sets the log- probability to the negated count for each n-gram in previous diverse translations, and sets to zero all other n-grams’ log-probabilities and back-off weights. [sent-80, score-0.792]

21 The advantage of this dissimilarity function is its simplicity. [sent-81, score-0.35]

22 Most closelyrelated is work by Devlin and Matsoukas (2012), who proposed a way to generate diverse translations by varying particular “traits,” such as translation length, number of rules applied, etc. [sent-85, score-0.868]

23 (2) with a richer dissimilarity function that requires a special-purpose decoding algorithm. [sent-87, score-0.418]

24 We chose our n-gram dissimilarity function due to its simplicity and applicability to most MT systems without requiring any change to decoders. [sent-88, score-0.35]

25 (2013) used bagging and boosting to get diverse system outputs for system combination and Cer et al. [sent-90, score-0.642]

26 We instead seek a set of translations that, when considered as a whole, similarly express the full range of the model’s beliefs about plausible translations for the input. [sent-97, score-0.536]

27 Also related is work on determinantal point processes (DPPs; Kulesza and Taskar, 2010), an elegant probabilistic model over sets of items that naturally prefers diverse sets. [sent-98, score-0.521]

28 We used the learned parameters to generate M-best and diverse lists for TUNE2 and TEST to use for subsequent experiments. [sent-141, score-0.775]

29 3 Diverse List Generation Generating diverse translations depends on two hyperparameters: the n-gram order used by the dissimilarity function ∆n (§3. [sent-143, score-1.074]

30 2) and the λj weights on the dissimilarity terms (i§n3 Eq. [sent-144, score-0.323]

31 The values of n and λ were tuned on a 200 sentence subset of TUNE1 separately for each language pair (which we call TUNE200), so as to maximize the oracle BLEU score of the diverse 1103 of M-best and diverse lists. [sent-148, score-1.045]

32 Unique lists were obtained from 1,000-best lists and therefore may not contain the target number of unique translations for all sentences. [sent-149, score-0.901]

33 2 Many MT decoders, including the phrase-based and hierarchical implementations in Moses, permit efficient extraction of N-best lists, so we exploit this to obtain larger lists that still exhibit diversity. [sent-168, score-0.338]

34 But we note that these N-best lists for each diverse solution are not in themselves diverse; with more computational power or more efficient algorithms (Devlin and Matsoukas, 2012) we could potentially generate larger, more diverse lists. [sent-169, score-1.244]

35 6 Analysis of Diverse Lists We now characterize our diverse lists by comparing them to M-best lists. [sent-170, score-0.775]

36 Table 1 shows oracle BLEU scores on TEST for M-best lists, unique Mbest lists, and diverse lists of several sizes. [sent-171, score-0.888]

37 When comparing M-best and diverse lists of comparable size, the diverse lists al1Since BLEU does not decompose additively across segments, we chose translations for individual sentences that maximized BLEU+1 (Lin and Och, 2004), then computed “oracle” corpus BLEU of these translations. [sent-173, score-1.833]

38 2We did not consider n-grams from previous N-best lists when computing the dissimilarity function, but only those from the previous diverse translations. [sent-174, score-1.098]

39 70 1-best BLEU bin Figure 1: Median, min, and max BLEU+1 of 20-best and 20-diverse lists for the ZH→EN test set, divided into quartiles according to the BL→EU+1 score of the 1-best translation, and averaged across sentences in each quartile. [sent-175, score-0.452]

40 The differences are largest when comparing 20-best lists and 20-diverse lists, where they range from 4 to 6 BLEU points. [sent-178, score-0.332]

41 When generating these diverse lists, we used the × n and λ values that were tuned for each language pair to maximize oracle BLEU on TUNE200 for the “20 div 50 best” configuration. [sent-179, score-0.667]

42 They suggest that for optimal oracle BLEU, translations with long-spanning amounts of repeated material should be avoided, while short overlapping n-grams are permitted. [sent-183, score-0.37]

43 We div→ided the TEST sentences into quartiles based on BLEU+1 of the 1-best translations from the baseline system. [sent-186, score-0.374]

44 As shown in the plot, the ranges of 20-diverse lists subsume those of 20-best lists, though the medians of diverse 3The optimal values of λ were 0. [sent-188, score-0.775]

45 1104 lists drop when the baseline system has high BLEU score. [sent-192, score-0.369]

46 This matches intuition: when the baseline system is performing well, forcing it to find different translations is likely to result in worse translations. [sent-193, score-0.318]

47 So we may expect diverse lists to be most helpful for more difficult sentences, a point we return to in our experiments below. [sent-194, score-0.775]

48 7 System Combination Experiments One way to evaluate the quality of our diverse lists is to use them in system combination, as was similarly done by Devlin and Matsoukas (2012) and Cer et al. [sent-195, score-0.806]

49 4 We use our baseline systems (trained on TUNE1) to generate lists for system combination on TUNE2 and TEST. [sent-198, score-0.42]

50 System combination hyperparameters (whether to use feature length normalization; the size of the k-best lists generated by the system combiner during tuning, k ∈ {300, 600}) were mcho csoemnb tion emra dxuimriinzge tBunLEinUg, on TUNE200. [sent-202, score-0.486]

51 But we see larger improvements with diverse lists for AR→EN and ZH→EN. [sent-207, score-0.775]

52 So we used a structured support vector machine learning framework instead (described in Section 8), using multiple iterations of learning interleaved with (system combiner) N-best list generation, and accumulating N-best lists across iterations. [sent-209, score-0.404]

53 quartiles (numbered “qn”) according to BLEU+1 of the 1-best translations of the baseline system. [sent-233, score-0.374]

54 gains are similar to those seen by Devlin and Matsoukas, but use our simpler dissimilarity function. [sent-235, score-0.356]

55 This may be a worthwhile trade-off: a large improvement in the worst translations may be more significant to users than a smaller degredation on sentences that are already being translated well. [sent-242, score-0.306]

56 Then system combination of diverse translations might be used only when the 1-best translation is predicted to be of low quality. [sent-246, score-0.95]

57 , 2004; Hildebrand and Vogel, 2008); some have attributed its mixed results to a lack of diversity in the M-best lists traditionally used. [sent-258, score-0.403]

58 We propose diverse lists as a way to address this concern. [sent-259, score-0.775]

59 We report results using the baseline system alone (labeled “N/A (baseline)”), and reranking standard M-best lists and our diverse lists. [sent-336, score-0.979]

60 For diverse lists, we use the “20 div 5di0v beersset” l iliststss. [sent-337, score-0.56]

61 3, w thieth “ 2th0e dtuivne ×d dissimilarity hyperparameters reported in Section 6. [sent-339, score-0.375]

62 For AR→EN, we see the largest gains, both over the baseli→ne as well as differences between M-best lists and diverse lists. [sent-342, score-0.801]

63 Nonetheless, diverse lists appear to be more robust for these language pairs as features are added. [sent-350, score-0.775]

64 In Table 5, we compare several sizes and types of lists for AR→EN reranking both with no additional features and→ →with the full set. [sent-351, score-0.447]

65 Also, retaining 50-best lists for each diverse solution improves BLEU by 0. [sent-353, score-0.775]

66 v78 Table 6: Comparing M-best and diverse lists for training/testing (AR→EN, all features). [sent-358, score-0.775]

67 Thus far, when training the reranker on M-best lists, we tested it on M-best lists, and similarly for diverse lists. [sent-359, score-0.521]

68 When training on div→erse lists, we see very little difference in BLEU whether testing on M-best or diverse lists. [sent-361, score-0.469]

69 This has a practical benefit: we can use (computationally-expensive) diverse lists during offline training and then use fast M-best lists at test time. [sent-362, score-1.081]

70 When training on M-best lists and testing on diverse lists, we see a substantial drop (51. [sent-363, score-0.775]

71 The reranker may be overfitting to the limited scope of translations present in typical M-best lists, thereby hindering its ability to correctly rank diverse lists at test time. [sent-366, score-1.082]

72 These results suggest that part of the benefit of using diverse lists comes from seeing a larger portion of the output space during training. [sent-367, score-0.813]

73 9 Human Post-Editing Experiments We wanted to determine whether diverse translations could be helpful to users struggling to understand the output of an imperfect MT system. [sent-368, score-0.85]

74 We compare the use of entries from an M-best list and entries from a diverse list. [sent-374, score-0.618]

75 Our goal is to determine whether multiple, diverse translations can help users to more accurately guess the meaning of the original sentence than entries from a standard M-best list. [sent-376, score-0.854]

76 If so, commercial MT systems might permit users to request additional diverse translations for those sentences whose model-best translations are difficult to understand. [sent-377, score-1.03]

77 Half of the time, the worker is shown 3 entries from an M-best list, and the other half of the time 3 entries from a diverse list. [sent-383, score-0.575]

78 The goal is to measure whether workers are able to produce translations that are closer in meaning to the (unseen) references when shown diverse translations. [sent-385, score-0.763]

79 To evaluate the outputs, we use a second task in which users are shown a reference translation along with two outputs from the first task: one created from M-best lists and one from diverse lists. [sent-387, score-1.03]

80 Workers in this task are asked to choose which translation is a better match to the reference in terms of mean- ing, or they can indicate that the translations are of the same quality. [sent-388, score-0.399]

81 2 Dissimilarity Functions To generate diverse lists for the EDITING task, we use the same dissimilarity function as in reranking, but we tune the hyperparameters n and λ differently. [sent-391, score-1.177]

82 Since our expectation here is that workers may combine information from multiple translations to produce a superior output, we are interested in the coverage of the translations in the diverse list, rather than the oracle BLEU score. [sent-392, score-1.097]

83 We maximized this metric over diverse lists of length 5, for n ∈ {2, 3, . [sent-396, score-0.803]

84 This suggests that, when maximizing coverage of a small diverse list, more dissimilarity is desired among the translations. [sent-411, score-0.792]

85 2) and also generated a diverse list of length 5 using the dissimilarity function ∆ with hyperparameters tuned using the procedure from the previous section. [sent-416, score-0.941]

86 We did the same using entries 1, i, and j from the diverse list. [sent-424, score-0.522]

87 For each sentence, we had 3 postedited outputs generated using entries in 5-best lists and 3 post-edited outputs from diverse lists. [sent-427, score-0.948]

88 In general, when the BLEU score of the baseline system is below 35, it is preferable to give diverse translations to users for post-editing. [sent-439, score-0.838]

89 But when the baseline system does very well, diverse translations do not contribute anything, and in fact hurt because they may distract users from the high-quality (and typically very similar) translations from the 5-best lists. [sent-440, score-1.093]

90 Future work could investigate whether such automatic confidence estimation could be used to identify situations in which diverse translations can be helpful for aiding user understanding. [sent-446, score-0.724]

91 1109 10 Future Work Our dissimilarity function captures diversity in the particular phrases used by an MT system, but for certain applications we may prefer other types of diversity. [sent-447, score-0.447]

92 Defining the dissimilarity function on POS tags or word clusters would help us to capture stylistic patterns in sentence structure, as would targeting syntactic structures in syntax-based translation. [sent-448, score-0.35]

93 A weakness of our approach is its computational expense; by contrast, the method of Devlin and Matsoukas (2012) obtains diverse translations more efficiently by extracting them from a single decoding of an input sentence (albeit with a wide beam). [sent-449, score-0.792]

94 We expect their ideas to be directly applicable to our setting in order to get diverse solutions more cheaply. [sent-450, score-0.501]

95 We also plan to explore methods of explicitly targeting multiple, diverse solutions as part of the search algorithm. [sent-451, score-0.501]

96 Finally, M-best lists are currently used to approximate structured spaces for many areas of MT, including tuning (Och, 2003), minimum Bayes risk decoding (Kumar and Byrne, 2004), and pipelines (Venugopal et al. [sent-452, score-0.544]

97 Future work could replace M-best lists with diverse lists in these and related tasks, whether for MT or other areas of structured NLP. [sent-454, score-1.108]

98 Combining machine translation output with open source: The Carnegie Mellon multi-engine machine translation scheme. [sent-638, score-0.382]

99 Efficient minimum error rate training and minimum Bayes-risk decoding for translation hypergraphs and lattices. [sent-756, score-0.35]

100 An empirical study on computing consensus translations from multiple machine translation systems. [sent-795, score-0.459]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('diverse', 0.469), ('dissimilarity', 0.323), ('lists', 0.306), ('bleu', 0.264), ('translations', 0.255), ('en', 0.151), ('mt', 0.148), ('zh', 0.144), ('translation', 0.144), ('reranking', 0.141), ('batra', 0.14), ('ar', 0.12), ('devlin', 0.097), ('matsoukas', 0.097), ('diversity', 0.097), ('div', 0.091), ('och', 0.09), ('hildebrand', 0.087), ('quartiles', 0.087), ('oracle', 0.079), ('heafield', 0.078), ('lms', 0.073), ('kulesza', 0.069), ('minimum', 0.069), ('decoding', 0.068), ('kumar', 0.065), ('macherey', 0.065), ('outputs', 0.06), ('koehn', 0.056), ('specia', 0.056), ('eval', 0.055), ('entries', 0.053), ('determinantal', 0.052), ('dpps', 0.052), ('yadollahpour', 0.052), ('hyperparameters', 0.052), ('reranker', 0.052), ('combination', 0.051), ('users', 0.051), ('mert', 0.05), ('tuning', 0.047), ('editing', 0.046), ('combiner', 0.046), ('discriminative', 0.044), ('list', 0.043), ('venugopal', 0.042), ('shen', 0.039), ('workers', 0.039), ('bach', 0.039), ('byrne', 0.039), ('monz', 0.039), ('yx', 0.039), ('output', 0.038), ('dyer', 0.037), ('bojar', 0.037), ('imperfect', 0.037), ('arabic', 0.036), ('repeated', 0.036), ('atxx', 0.035), ('chatterjee', 0.035), ('min', 0.035), ('unique', 0.034), ('gains', 0.033), ('solutions', 0.032), ('consensus', 0.032), ('hierarchical', 0.032), ('max', 0.032), ('baseline', 0.032), ('system', 0.031), ('pauls', 0.031), ('tsochantaridis', 0.03), ('gertz', 0.03), ('quartile', 0.03), ('soricut', 0.03), ('translator', 0.03), ('wclm', 0.03), ('vogel', 0.03), ('cer', 0.029), ('hx', 0.029), ('statistical', 0.028), ('maximize', 0.028), ('machine', 0.028), ('maximized', 0.028), ('sentencelevel', 0.028), ('kenlm', 0.028), ('postediting', 0.028), ('tromble', 0.028), ('risk', 0.027), ('bin', 0.027), ('nc', 0.027), ('function', 0.027), ('procedure', 0.027), ('chinese', 0.027), ('structured', 0.027), ('median', 0.026), ('plausible', 0.026), ('largest', 0.026), ('guess', 0.026), ('hy', 0.026), ('rosti', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

Author: Kevin Gimpel ; Dhruv Batra ; Chris Dyer ; Gregory Shakhnarovich

2 0.17707688 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

Author: Joern Wuebker ; Stephan Peitz ; Felix Rietig ; Hermann Ney

Abstract: Automatically clustering words from a monolingual or bilingual training corpus into classes is a widely used technique in statistical natural language processing. We present a very simple and easy to implement method for using these word classes to improve translation quality. It can be applied across different machine translation paradigms and with arbitrary types of models. We show its efficacy on a small German→English and a larger F ornenc ah s→mGalelrm Gaenrm mtarann→slEatniognli tsahsk a nwdit ha lbaortghe rst Farnednacrhd→ phrase-based salandti nhie traaskrch wiciathl phrase-based translation systems for a common set of models. Our results show that with word class models, the baseline can be improved by up to 1.4% BLEU and 1.0% TER on the French→German task and 0.3% BLEU aonnd t h1e .1 F%re nTcEhR→ on tehrem German→English Btask.

3 0.17529313 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

Author: Ming Tan ; Tian Xia ; Shaojun Wang ; Bowen Zhou

Abstract: MIRA based tuning methods have been widely used in statistical machine translation (SMT) system with a large number of features. Since the corpus-level BLEU is not decomposable, these MIRA approaches usually define a variety of heuristic-driven sentencelevel BLEUs in their model losses. Instead, we present a new MIRA method, which employs an exact corpus-level BLEU to compute the model loss. Our method is simpler in implementation. Experiments on Chinese-toEnglish translation show its effectiveness over two state-of-the-art MIRA implementations.

4 0.15910834 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

Author: Xinyan Xiao ; Deyi Xiong

Abstract: Traditional synchronous grammar induction estimates parameters by maximizing likelihood, which only has a loose relation to translation quality. Alternatively, we propose a max-margin estimation approach to discriminatively inducing synchronous grammars for machine translation, which directly optimizes translation quality measured by BLEU. In the max-margin estimation of parameters, we only need to calculate Viterbi translations. This further facilitates the incorporation of various non-local features that are defined on the target side. We test the effectiveness of our max-margin estimation framework on a competitive hierarchical phrase-based system. Experiments show that our max-margin method significantly outperforms the traditional twostep pipeline for synchronous rule extraction by 1.3 BLEU points and is also better than previous max-likelihood estimation method.

5 0.14921372 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

Author: Ann Irvine ; Chris Quirk ; Hal Daume III

Abstract: When using a machine translation (MT) model trained on OLD-domain parallel data to translate NEW-domain text, one major challenge is the large number of out-of-vocabulary (OOV) and new-translation-sense words. We present a method to identify new translations of both known and unknown source language words that uses NEW-domain comparable document pairs. Starting with a joint distribution of source-target word pairs derived from the OLD-domain parallel corpus, our method recovers a new joint distribution that matches the marginal distributions of the NEW-domain comparable document pairs, while minimizing the divergence from the OLD-domain distribution. Adding learned translations to our French-English MT model results in gains of about 2 BLEU points over strong baselines.

6 0.13270231 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

7 0.12669905 159 emnlp-2013-Regularized Minimum Error Rate Training

8 0.11710663 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

9 0.11697252 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation

10 0.11177946 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

11 0.10960463 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation

12 0.1085435 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

13 0.1018079 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

14 0.10101771 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

15 0.095722243 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

16 0.092606962 201 emnlp-2013-What is Hidden among Translation Rules

17 0.091929622 55 emnlp-2013-Decoding with Large-Scale Neural Language Models Improves Translation

18 0.087838851 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk

19 0.080612637 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

20 0.078165188 145 emnlp-2013-Optimal Beam Search for Machine Translation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.245), (1, -0.242), (2, 0.074), (3, 0.048), (4, 0.113), (5, -0.067), (6, 0.021), (7, 0.033), (8, 0.013), (9, -0.046), (10, -0.043), (11, 0.018), (12, 0.066), (13, -0.122), (14, 0.03), (15, -0.122), (16, 0.033), (17, 0.062), (18, 0.016), (19, 0.117), (20, 0.001), (21, 0.118), (22, -0.056), (23, 0.018), (24, -0.009), (25, 0.017), (26, -0.089), (27, 0.002), (28, -0.022), (29, -0.097), (30, 0.11), (31, -0.04), (32, -0.056), (33, -0.052), (34, -0.05), (35, 0.001), (36, -0.058), (37, -0.081), (38, 0.059), (39, -0.003), (40, 0.007), (41, 0.036), (42, -0.001), (43, -0.057), (44, -0.046), (45, 0.018), (46, -0.036), (47, -0.058), (48, -0.038), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97292489 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

Author: Kevin Gimpel ; Dhruv Batra ; Chris Dyer ; Gregory Shakhnarovich

2 0.90155482 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

Author: Ming Tan ; Tian Xia ; Shaojun Wang ; Bowen Zhou

3 0.76466322 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

Author: Joern Wuebker ; Stephan Peitz ; Felix Rietig ; Hermann Ney

4 0.69673669 159 emnlp-2013-Regularized Minimum Error Rate Training

Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova

Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.

5 0.65392083 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Author: Heng Yu ; Liang Huang ; Haitao Mi ; Kai Zhao

Abstract: While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top ∼100 most frequent wt woridths) f eaantdu overly complicated. oWste f iren-stead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT. 1Introduction Large-scale discriminative training has witnessed great success in many NLP problems such as parsing (McDonald et al., 2005) and tagging (Collins, 2002), but not yet for machine translation (MT) despite numerous recent efforts. Due to scalability issues, most of these recent methods can only train on a small dev set of about a thousand sentences rather than on the full training set, and only with 2,000–10,000 rather “dense-like” features (either unlexicalized or only considering highest-frequency words), as in MIRA (Watanabe et al., 2007; Chiang et al., 2008; Chiang, 2012), PRO (Hopkins and May, 2011), and RAMP (Gimpel and Smith, 2012). However, it is well-known that the most important features for NLP are lexicalized, most of which can not ∗ Work done while visiting City University of New York. Corresponding author. † 1112 be seen on a small dataset. Furthermore, these methods often involve complicated loss functions and intricate choices of the “target” derivations to update towards or against (e.g. k-best/forest oracles, or hope/fear derivations), and are thus hard to replicate. As a result, the classical method of MERT (Och, 2003) remains the default training algorithm for MT even though it can only tune a handful of dense features. See also Section 6 for other related work. As a notable exception, Liang et al. (2006) do train a structured perceptron model on the training data with sparse features, but fail to outperform MERT. We argue this is because structured perceptron, like many structured learning algorithms such as CRF and MIRA, assumes exact search, and search errors inevitably break theoretical properties such as convergence (Huang et al., 2012). Empirically, it is now well accepted that standard perceptron performs poorly when search error is severe (Collins and Roark, 2004; Zhang et al., 2013). To address the search error problem we propose a very simple approach based on the recent framework of “violation-fixing perceptron” (Huang et al., 2012) which is designed specifically for inexact search, with a theoretical convergence guarantee and excellent empirical performance on beam search parsing and tagging. The basic idea is to update when search error happens, rather than at the end of the search. To adapt it to MT, we extend this framework to handle latent variables corresponding to the hidden derivations. We update towards “gold-standard” derivations computed by forced decoding so that each derivation leads to the exact reference translation. Forced decoding is also used as a way of data selection, since those reachable sentence pairs are generally more literal and of higher quality, which the training should focus on. When the reachable subset is small for some language pairs, we augment Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t1ic2s–1 23, it by including reachable prefix-pairs when the full sentence pair is not. We make the following contributions: 1. Our work is the first successful effort to scale online structured learning to a large portion of the training data (as opposed to the dev set). 2. Our work is the first to use a principled learning method customized for inexact search which updates on partial derivations rather than full ones in order to fix search errors. We adapt it to MT using latent variables for derivations. 3. Contrary to the common wisdom, we show that simply updating towards the exact reference translation is helpful, which is much simpler than k-best/forest oracles or loss-augmented (e.g. hope/fear) derivations, avoiding sentencelevel BLEU scores or other loss functions. 4. We present a convincing analysis that it is the search errors and standard perceptron’s inability to deal with them that prevent previous work, esp. Liang et al. (2006), from succeeding. 5. Scaling to the training data enables us to engineer a very rich feature set of sparse, lexicalized, and non-local features, and we propose various ways to alleviate overfitting. For simplicity and efficiency reasons, in this paper we use phrase-based translation, but our method has the potential to be applicable to other translation paradigms. Extensive experiments on both Chineseto-English and Spanish-to-English tasks show statistically significant gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, and up to +1.5/+1.5 over PRO, thanks to 20M+ sparse features. 2 Phrase-Based MT and Forced Decoding We first review the basic phrase-based decoding algorithm (Koehn, 2004), which will be adapted for forced decoding. 2.1 Background: Phrase-based Decoding We will use the following running example from Chinese to English from Mi et al. (2008): 0123456 Figure 1: Standard beam-search phrase-based decoding. B `ush´ ı y uˇ Sh¯ al´ ong j ˇux ´ıng le hu` ıt´ an Bush with Sharon hold -ed meeting ‘Bush held a meeting with Sharon’ Phrase-based decoders generate partial targetlanguage outputs in left-to-right order in the form of hypotheses (or states) (Koehn, 2004). Each hypothesis has a coverage vector capturing the sourcelanguage words translated so far, and can be extended into a longer hypothesis by a phrase-pair translating an uncovered segment. For example, the following is one possible derivation: (• 3(• •() • :1( •s063),:“(Bs)u2s:,h)“(hBs:e1ul(d,s0“ht,aB“hleuk”ls) hdw”t)ailhkrsS1”h)aro2n”)r3 where a • in the coverage vector indicates the source wwoherdre a at •th i ns position aisg e“ vcoecvteorred in”d iacnadte ws thheer seo euarcche si is the score of each state, each adding the rule score and the distortion cost (dc) to the score of the previous state. To compute the distortion cost we also need to maintain the ending position of the last phrase (e.g., the 3 and 6 in the coverage vectors). In phrase-based translation there is also a distortionlimit which prohibits long-distance reorderings. The above states are called −LM states since they do Tnhoet ainbovovleve st language mlleodd −el LcMos tsst.a eTso iandcde a beiygram model, we split each −LM state into a series ogrfa +mL mMo states; ee sapchli t+ eaLcMh −staLtMe h satsa ttehe in ftoor ma (v,a) where a is the last word of the hypothesis. Thus a +LM version of the above derivation might be: (• 3(• ,(•Sh1a•(r6o0,nta)l:ks,()Bsu:03sh,(s“<)s02

6 0.652219 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

7 0.6474995 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

8 0.6221047 52 emnlp-2013-Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation

9 0.61750561 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation

10 0.60686916 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk

11 0.59952003 55 emnlp-2013-Decoding with Large-Scale Neural Language Models Improves Translation

12 0.5889141 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

13 0.5851323 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

14 0.53169078 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation

15 0.5066449 2 emnlp-2013-A Convex Alternative to IBM Model 2

16 0.49842596 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

17 0.49363482 71 emnlp-2013-Efficient Left-to-Right Hierarchical Phrase-Based Translation with Improved Reordering

18 0.47664496 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

19 0.46957409 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings

20 0.46550974 201 emnlp-2013-What is Hidden among Translation Rules

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.01), (3, 0.023), (8, 0.19), (10, 0.01), (18, 0.038), (22, 0.045), (30, 0.13), (43, 0.014), (45, 0.02), (50, 0.042), (51, 0.164), (66, 0.037), (71, 0.022), (75, 0.043), (77, 0.082), (95, 0.012), (96, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85824484 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

Author: Kevin Gimpel ; Dhruv Batra ; Chris Dyer ; Gregory Shakhnarovich

2 0.84408408 137 emnlp-2013-Multi-Relational Latent Semantic Analysis

Author: Kai-Wei Chang ; Wen-tau Yih ; Christopher Meek

Abstract: We present Multi-Relational Latent Semantic Analysis (MRLSA) which generalizes Latent Semantic Analysis (LSA). MRLSA provides an elegant approach to combining multiple relations between words by constructing a 3-way tensor. Similar to LSA, a lowrank approximation of the tensor is derived using a tensor decomposition. Each word in the vocabulary is thus represented by a vector in the latent semantic space and each relation is captured by a latent square matrix. The degree of two words having a specific relation can then be measured through simple linear algebraic operations. We demonstrate that by integrating multiple relations from both homogeneous and heterogeneous information sources, MRLSA achieves state- of-the-art performance on existing benchmark datasets for two relations, antonymy and is-a.

3 0.84038651 90 emnlp-2013-Generating Coherent Event Schemas at Scale

Author: Niranjan Balasubramanian ; Stephen Soderland ; Mausam ; Oren Etzioni

Abstract: Chambers and Jurafsky (2009) demonstrated that event schemas can be automatically induced from text corpora. However, our analysis of their schemas identifies several weaknesses, e.g., some schemas lack a common topic and distinct roles are incorrectly mixed into a single actor. It is due in part to their pair-wise representation that treats subjectverb independently from verb-object. This often leads to subject-verb-object triples that are not meaningful in the real-world. We present a novel approach to inducing open-domain event schemas that overcomes these limitations. Our approach uses cooccurrence statistics of semantically typed relational triples, which we call Rel-grams (relational n-grams). In a human evaluation, our schemas outperform Chambers’s schemas by wide margins on several evaluation criteria. Both Rel-grams and event schemas are freely available to the research community.

4 0.75163591 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

Author: Uri Lerner ; Slav Petrov

Abstract: We present a simple and novel classifier-based preordering approach. Unlike existing preordering models, we train feature-rich discriminative classifiers that directly predict the target-side word order. Our approach combines the strengths of lexical reordering and syntactic preordering models by performing long-distance reorderings using the structure of the parse tree, while utilizing a discriminative model with a rich set of features, including lexical features. We present extensive experiments on 22 language pairs, including preordering into English from 7 other languages. We obtain improvements of up to 1.4 BLEU on language pairs in the WMT 2010 shared task. For languages from different families the improvements often exceed 2 BLEU. Many of these gains are also significant in human evaluations.

5 0.75038588 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

6 0.74849844 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation

7 0.74651885 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

8 0.7455371 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

9 0.74506128 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

10 0.74006313 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

11 0.73804474 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

12 0.73683548 22 emnlp-2013-Anchor Graph: Global Reordering Contexts for Statistical Machine Translation

13 0.73313677 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

14 0.73272216 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation

15 0.73198128 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

16 0.7298339 128 emnlp-2013-Max-Violation Perceptron and Forced Decoding for Scalable MT Training

17 0.72513282 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

18 0.72486877 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation

19 0.72460282 143 emnlp-2013-Open Domain Targeted Sentiment

20 0.72427517 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization