acl acl2013 acl2013-235 knowledge-graph by maker-knowledge-mining

235 acl-2013-Machine Translation Detection from Monolingual Web-Text


Source: pdf

Author: Yuki Arase ; Ming Zhou

Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. [sent-7, score-0.236]

2 We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. [sent-8, score-0.725]

3 Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. [sent-9, score-0.19]

4 Evaluation results show that the proposed method achieves an accuracy of 95. [sent-10, score-0.184]

5 With recent advances in statistical machine translation (SMT) systems and their wide adoption in Web services through APIs (Microsoft Translator, 2009; Google Translate, 2006), a large amount of text in Web pages is translated by SMT systems. [sent-23, score-0.208]

6 com they are of sufficient quality and indistinguishable from human-generated sentences; however, the quality of these machine-translated sentences is generally much lower than sentences generated by native speakers and professional translators. [sent-32, score-0.287]

7 To solve this problem, we propose a method for automatically detecting Web-text translated by SMT systems1 . [sent-34, score-0.195]

8 We especially target machinetranslated text produced through the Web APIs that is rapidly increasing. [sent-35, score-0.305]

9 We focus on the phrase salad phenomenon (Lopez, 2008), which char- acterizes translations by existing SMT systems, i. [sent-36, score-0.572]

10 , each phrase in a sentence is semantically and syntactically correct but becomes incorrect when combined with other phrases in the sentence. [sent-38, score-0.289]

11 Based on this trait, we propose features for evaluating the likelihood of machine-translated sentences and use a classifier to determine whether the sentence is generated by the SMT systems. [sent-39, score-0.222]

12 Therefore, our method can be used in monolingual Web data mining where bilingual information is unavailable. [sent-43, score-0.198]

13 Our method determines if an input sentence contains phrase salads using a simple yet effective features, i. [sent-45, score-0.321]

14 Third, our method computes features using both human-generated text and SMT 1In this paper, the term machine-translated is used for indicating translation by SMT systems. [sent-48, score-0.184]

15 Ac s2s0o1ci3a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 1597–1607, results to capture a phrase salad by contrasting these features, which significantly improves detection accuracy. [sent-51, score-0.675]

16 The results show that our method achieves an accuracy of 95. [sent-53, score-0.184]

17 2 Related Work Previous methods for detecting machinetranslated text are mostly designed for bilingual corpus construction. [sent-56, score-0.416]

18 In contrast, our method aims at making a binary judgment to distinguish machine-translated sentences from a mixture of machine-translated and human-generated sentences. [sent-69, score-0.162]

19 In contrast, our method does not specify error types and aims to detect machine-translated sentences focusing on the phrase salad phenomenon produced by SMT systems. [sent-75, score-0.734]

20 ESL learners make spelling and grammar mistakes at the word level but their sentence are generally structured while SMT results are unstructured due to phrase salads. [sent-77, score-0.236]

21 Specifically, the method constructs LMs using corpora of target and non-target domains and computes a cross-entropy score of an input sentence for estimating the likelihood that the input sentence belongs to the target or non-target domains. [sent-84, score-0.198]

22 While the context is different, our work uses a similar idea of data selection for the purpose of detecting low-quality sentences translated by SMT systems. [sent-85, score-0.237]

23 In addition, human-generated and machinetranslated sentences are often mixed together even in a single paragraph. [sent-90, score-0.375]

24 To observe the distribution of machine-translated sentences in such difficult cases, we examine 3K sentences collected by our in-house Web crawler. [sent-91, score-0.204]

25 Our goal is to automatically identify these sentences that cannot be simply detected by the tags, except when the sentences are of sufficient quality to be indistinguishable from human-generated sentences. [sent-94, score-0.25]

26 1illustrates the phrase salad phenomenon that characterizes a sentence translated by an existing 1598 | Of surprise | was up | foreigners flocked | overseas | as well, | they publicized not only | Japan, | saw an article from the news. [sent-97, score-0.829]

27 Figure 1: The phrase salad phenomenon in a sentence translated by an SMT system; each (segmUennntaetdu)r a pl hprahsreas eis sceoqrureecntc eand fluent, but dotted arcs show unnatural sequences of phrases and the boxed phrase shows an incomMpleistesi nngo nco-cmobntiingautoiounsa lp whroasrde. [sent-99, score-0.972]

28 In addition, a phrase salad becomes obvious by observing distant phrases. [sent-102, score-0.578]

29 Such non-contiguous phrases are difficult for most SMT systems to generate, since these phrases require insertion of subphrases in distant parts of the sentence. [sent-106, score-0.26]

30 Based on the observation of these characteristics, we define features to capture a phrase salad by examining local and distant phrases. [sent-107, score-0.67]

31 3), and (3) completeness of non-contiguous phrases in a sentence (Sec. [sent-112, score-0.197]

32 Furthermore, humans can distinguish machinetranslated text because they have prior knowledge of how a human-generated sentence would look like, which has been accumulated by observing a lot of examples through their life. [sent-115, score-0.374]

33 By contrasting these feature weights, we can effectively capture phrase salads in the sentence. [sent-122, score-0.351]

34 2 Fluency Feature In a machine-translated sentence, fluency becomes poor among phrases where a phrase salad occurs. [sent-124, score-0.718]

35 We input a sentence into both of the LMs and use the scores as the fluency features. [sent-129, score-0.181]

36 3 Grammaticality Feature In a sentence with phrase salads, its grammaticality is poor because tense and voice become inconsistent among phrases. [sent-131, score-0.265]

37 In a similar manner with a word-based LM, such grammatical inconsistency among phrases is detectable when computing a POS LM score, since the score becomes worse when an N-gram covers inter-phrases where a phrase salad occurs. [sent-133, score-0.606]

38 Since a phrase salad may occur among distant phrases of a sentence, it is also effective to evaluate combinations of phrases that cannot be covered by the span of N-gram. [sent-135, score-0.807]

39 For example, the same preposition rarely appears many times in a humangenerated sentence, while it does in a machinetranslated sentence due to the phrase salad. [sent-137, score-0.623]

40 ” When a sentence contains the phrase “not only,” the phrase “but also” is likely to appear in human-generated setences. [sent-152, score-0.317]

41 Given a set of sequences and a user-specified min support ∈ N threshold, the sequential pattern mining fpinordst ∈all N frequent subsequences iwalho psaet occurrence frequency is no less than min support. [sent-159, score-0.168]

42 To capture a phrase salad by contrasting appearance of gappy-phrases in human-generated and machine-translated text, we independently extract gappy-phrases from each of them using PrefixSpan. [sent-166, score-0.617]

43 1 that includes 254K human-generated and 134K machinetranslated sentences in Japanese, and 210K human-generated and 159K machine-translated sentences in English. [sent-171, score-0.477]

44 We also obtain about 74K and 42K phrases from humangenerated and machine-translated sentences in the English dataset (21K of them are common). [sent-179, score-0.405]

45 Therefore, our method selects useful phrases for detecting machine-translated sentences. [sent-182, score-0.22]

46 Specifically, we evaluate gappy-phrases based on the information gain that measures the amount of information in bits obtained for class prediction when knowing the presence or absence of a phrase and the corresponding class distribution. [sent-186, score-0.197]

47 Table 2 shows examples of gappy-phrases extracted from human-generated and machinetranslated text in our development dataset and remain after feature selection. [sent-194, score-0.407]

48 oe atrhtnehd Table 2: Example of gappy-phrases extracted from humangenerated and machine-translated text; phrases preserving semantic meaning are extracted only from human-generated text. [sent-203, score-0.253]

49 The gappy-phrases depend on each other, and the more phrases extracted from human-generated (machine-translated) text are found in a sentence, the more likely the sentence is human-generated (machine-translated). [sent-204, score-0.197]

50 Therefore, we compute the feature as = fc(s) Xwiδ(i,s), Xi∈k where wi is a weight of the i-th phrase, and δ(i, s) is a Kronecker’s delta function that takes 1if the sentence s includes the i-th phrase and takes 0 otherwise. [sent-205, score-0.245]

51 In addition to the discussed features, we use the length of a sentence as a feature flen to avoid the bias of LM-based features that favor shorter sentences. [sent-210, score-0.24]

52 The proposed method takes a monolingual sentence from Web data as input and computes a feature vector of f = (fw,H, . [sent-211, score-0.232]

53 These sentences should be ensured to be human-generated or machinetranslated, and the human-generated and machinetranslated sentences express the same content for fairness of evaluation to avoid effects due to vocabulary difference. [sent-221, score-0.477]

54 As a dataset that meets these requirements, we use parallel text in public websites (this is for fair evaluation and our method can be trained using nonparallel text on an actual deployment). [sent-222, score-0.223]

55 The main textual content of these 131K parallel pages are extracted, and the sentences are aligned using (Ma, 2006). [sent-224, score-0.183]

56 2, the text in one language is fed to the Bing translator, Google Translate, and an in-house SMT system4 implemented based on (Chiang, 2005) by ourselves for obtaining sentences translated by SMT systems. [sent-226, score-0.205]

57 In this manner, we prepare 508K humangenerated and 268K machine-translated sentences as a Japanese dataset, and 420K human-generated and 318K machine-translated sentences as an English dataset. [sent-229, score-0.361]

58 2 Experiment Setting For the fluency and grammaticality features, we train 4-gram LMs using the development dataset with the SRI toolkit (Stolcke, 2002). [sent-232, score-0.234]

59 We evaluate the performance of MT detection based on accuracy6 that is a broadly used evaluation metric for classification problems: accuracy = + nTN n, nTP where nTP and nTN are the numbers of truepositives and true-negatives, respectively, and n is the total number of exemplars. [sent-243, score-0.219]

60 Additionally, we compare our method to a method that uses a feature indicating presence or absence of unigrams, which we call Lexical Feature. [sent-258, score-0.208]

61 This feature is commonly used for translationese detection and shows the best performance as a single feature in (Baroni and Bernardini, 2005). [sent-259, score-0.242]

62 (201 1) and shows the best performance by itself in detecting machine-translated sentences in English-Japanese translation in the setting of bilingual input. [sent-261, score-0.254]

63 1 Accuracy on Japanese Dataset We evaluate the sentence-level and documentlevel accuracy of our method using the Japanese dataset. [sent-265, score-0.221]

64 Specifically, we evaluate effects of individual features and their combinations, compare with human annotations, and assess performance variations across different sentence lengths and various settings on LM training. [sent-266, score-0.229]

65 Effect of Individual Feature Table 4 shows the accuracy scores of individual features and comparison methods. [sent-267, score-0.175]

66 We refer to features for fluency (fw,H, fw,MT) as Word LMs, grammaticality using POS LMs (fpos,H, fpos,MT) as POS LMs 1602 TableW5o:rdAWLcMoWrudsMao+LrcdPyM+OL(esA%PMSt+Olh)LsoPSdM+oOfLGsSMeP+LasFtMu+WrGsePLcMomsbinaAtoc9 n45us. [sent-268, score-0.235]

67 01) against the accuracy score and function word LMs (ffw,H, ffw,MT) as FW LMs, respectively, and for completeness of gappyphrases (fg,H, fg,MT) as GPs. [sent-272, score-0.201]

68 This high accuracy is achieved by contrasting fluency in human-generated and machine-translated text to capture the phrase salad phenomenon. [sent-276, score-0.885]

69 The accuracy of Word LM trained only on humangenerated sentences is limited to 65. [sent-277, score-0.383]

70 On the other hand, the accuracy of Word LM trained on machine-translated sentences shows a better performance (84. [sent-279, score-0.226]

71 By combining these into a single feature vector f = (fw,H, fw,MT, flen), the accuracy is largely improved. [sent-281, score-0.176]

72 This is effective for capturing a phrase salad that occurs among distant phrases, which N-gram cannot cover. [sent-285, score-0.578]

73 As for Cross-Entropy, a simple subtraction of cross-entropy scores cannot well contrast the fluency in human-generated and machine- translated text and results in poorer accuracy than Word LMs. [sent-286, score-0.339]

74 Sign tests show that the accuracy scores of these feature combinations are significantly different (p ? [sent-295, score-0.176]

75 s10and accuracy (%) of proposed method on the different errors combination of all features reaches an accuracy of 95. [sent-304, score-0.359]

76 This result supports that FW LMs and GPs are effective to capture a phrase salad occurring in distant phrases and complement the evidence in N-grams that is captured by LMs. [sent-307, score-0.715]

77 We also evaluate the accuracy of the proposed method at a document level. [sent-309, score-0.221]

78 Due to the high accuracy at the sentence-level, we use a voting method to judge a document, i. [sent-310, score-0.184]

79 We sample Japanese sentences and ask three native speakers to 1) judge whether a sentence is human-generated or machine-translated and 2) list errors that the sentence contains. [sent-315, score-0.277]

80 Table 6 shows the distribution of errors on machine-translated sentences found by the annotators (on sentences that they correctly classified) with the accuracy of Word LMs and all features on 1603 Num. [sent-324, score-0.422]

81 of words in a sentence Figure 3: Accuracy (%) across different sentence lengths (the primary axis) and distribution (%) of sentence lengths in the evaluation dataset (the secondly axis) these sentences (a sentence may contain multiple errors). [sent-325, score-0.572]

82 It indicates that the accuracy of Word LMs is improved by feature combination; from 1. [sent-326, score-0.176]

83 Effect of Sentence Length The accuracy of the proposed method is significantly affected by sentence length (the number of words in a sentence). [sent-329, score-0.253]

84 3 shows the accuracy of the proposed method (with all features) and comparison methods w. [sent-331, score-0.184]

85 sentence lengths (with the primary axis), as well as the distribution of sentence lengths in the evaluation dataset (with the secondly axis). [sent-334, score-0.332]

86 sentence lengths, which we obtain for the 700 sentences in the human evaluation. [sent-339, score-0.171]

87 The accuracy drops on all methods when sentences are short; the accuracy of our method is 91. [sent-340, score-0.41]

88 The proposed method shows the similar trend with the human annotations, and even the accuracy of human annotations significantly drops on such short sentences. [sent-342, score-0.216]

89 4 shows the accuracy of the LM based features and feature combination when changing sizes of N-grams. [sent-350, score-0.227]

90 The performance of Word LMs is stabilized after c%yr(cAauc)897 8 3 12N-gram34FPAW OLo SrLd ML sMs Figure 4: Effect of the sizes of N-grams on MT detection accuracy (%) 3-gram while that of POS LMs is still improved at 4-gram. [sent-351, score-0.222]

91 When we change the size of the development dataset with 10% increments, the accuracy curve is stabilized when the size is 40% of all set. [sent-354, score-0.214]

92 The combination of all features achieves the best performance, with an accuracy of 93. [sent-362, score-0.175]

93 To evaluate the accuracy of our method on real Web pages, we conduct experiments using the dataset generated by Rarrick et al. [sent-372, score-0.271]

94 (201 1) that contains randomly crawled Web pages annotated by two annotators to judge if a page is humangenerated or machine-translated. [sent-373, score-0.232]

95 We use Japanese sentences extracted from 69 pages (43 humangenerated and 26 machine-translated pages) where the annotators’ judgments agree; 3, 312 sentences consisting of 1, 399 machine-translated and 1, 913 human-generated sentences. [sent-374, score-0.393]

96 One factor for this performance difference is again sentence lengths, as SMT results of short phrases in Web pages can be of highquality. [sent-384, score-0.197]

97 1 where machinetranslated sentences are removed by the proposed method (LM-Proposed), Lexical Feature (LM-LF), and Cross-Entropy (LM-CE), as well as an LM with all sentences, i. [sent-392, score-0.435]

98 6 Conclusion We propose a method for detecting machinetranslated sentences from monolingual Web-text focusing on the phrase salad phenomenon produced by existing SMT systems. [sent-400, score-1.122]

99 The experimental results show that our method achieves an accuracy of 95. [sent-401, score-0.184]

100 We plan to extend our method to detect machine-translated sentences produced by different MT systems, e. [sent-404, score-0.162]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lms', 0.451), ('salad', 0.386), ('machinetranslated', 0.273), ('smt', 0.222), ('rarrick', 0.159), ('humangenerated', 0.157), ('lm', 0.137), ('japanese', 0.132), ('accuracy', 0.124), ('phrase', 0.124), ('fluency', 0.112), ('sentences', 0.102), ('phrases', 0.096), ('apis', 0.093), ('web', 0.086), ('fw', 0.084), ('translationese', 0.08), ('lengths', 0.072), ('grammaticality', 0.072), ('translated', 0.071), ('sentence', 0.069), ('distant', 0.068), ('flen', 0.068), ('prefixspan', 0.068), ('salads', 0.068), ('xg', 0.068), ('contrasting', 0.066), ('esl', 0.066), ('detecting', 0.064), ('sequential', 0.062), ('phenomenon', 0.062), ('pos', 0.061), ('method', 0.06), ('detection', 0.058), ('mt', 0.058), ('gamon', 0.057), ('microsoft', 0.056), ('feature', 0.052), ('axis', 0.052), ('monolingual', 0.051), ('features', 0.051), ('dataset', 0.05), ('gps', 0.05), ('parallel', 0.049), ('summit', 0.048), ('bilingual', 0.047), ('indistinguishable', 0.046), ('antonova', 0.045), ('avramidis', 0.045), ('ctne', 0.045), ('foreigners', 0.045), ('gappyphrases', 0.045), ('ilisei', 0.045), ('imt', 0.045), ('ishisaka', 0.045), ('japanes', 0.045), ('kurokawa', 0.045), ('ntn', 0.045), ('ntp', 0.045), ('uetm', 0.045), ('learners', 0.043), ('nie', 0.043), ('annotators', 0.043), ('effect', 0.043), ('translation', 0.041), ('capture', 0.041), ('cro', 0.041), ('overseas', 0.04), ('boxed', 0.04), ('danling', 0.04), ('haidian', 0.04), ('yuki', 0.04), ('erna', 0.04), ('stabilized', 0.04), ('mining', 0.04), ('google', 0.04), ('lewis', 0.038), ('native', 0.037), ('pei', 0.037), ('parton', 0.037), ('spencer', 0.037), ('evaluate', 0.037), ('absence', 0.036), ('bernardini', 0.035), ('ih', 0.035), ('baroni', 0.035), ('moore', 0.033), ('cat', 0.033), ('pattern', 0.033), ('nakashole', 0.033), ('subsequences', 0.033), ('xii', 0.033), ('text', 0.032), ('annotations', 0.032), ('pages', 0.032), ('services', 0.032), ('kudo', 0.032), ('completeness', 0.032), ('suchanek', 0.032), ('characterizes', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

Author: Yuki Arase ; Ming Zhou

Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.

2 0.26447487 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation

Author: Kevin Duh ; Graham Neubig ; Katsuhito Sudoh ; Hajime Tsukada

Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling un- known word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.

3 0.14796671 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

Author: Jiajun Zhang ; Chengqing Zong

Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1

4 0.13518591 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

5 0.12598573 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

Author: Lei Cui ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou

Abstract: The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks.

6 0.12160539 255 acl-2013-Name-aware Machine Translation

7 0.12043594 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

8 0.11462605 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation

9 0.11031114 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

10 0.10281213 195 acl-2013-Improving machine translation by training against an automatic semantic frame based evaluation metric

11 0.10059348 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

12 0.10018808 317 acl-2013-Sentence Level Dialect Identification in Arabic

13 0.098440647 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

14 0.096453249 289 acl-2013-QuEst - A translation quality estimation framework

15 0.095360801 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

16 0.087770097 135 acl-2013-English-to-Russian MT evaluation campaign

17 0.081823364 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

18 0.081591584 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

19 0.081509233 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

20 0.080569394 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.246), (1, -0.07), (2, 0.123), (3, 0.046), (4, 0.028), (5, -0.005), (6, -0.012), (7, -0.006), (8, 0.041), (9, 0.021), (10, -0.03), (11, 0.022), (12, -0.033), (13, 0.067), (14, -0.095), (15, 0.041), (16, -0.061), (17, -0.046), (18, -0.074), (19, -0.015), (20, 0.083), (21, -0.038), (22, 0.017), (23, 0.056), (24, 0.091), (25, 0.069), (26, 0.013), (27, -0.035), (28, 0.061), (29, -0.043), (30, -0.091), (31, -0.002), (32, -0.056), (33, -0.076), (34, -0.007), (35, 0.002), (36, 0.054), (37, 0.048), (38, -0.067), (39, -0.033), (40, -0.057), (41, -0.062), (42, 0.085), (43, -0.019), (44, 0.01), (45, 0.059), (46, -0.079), (47, 0.094), (48, 0.008), (49, -0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91071802 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

Author: Yuki Arase ; Ming Zhou

Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.

2 0.73831952 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation

Author: Kevin Duh ; Graham Neubig ; Katsuhito Sudoh ; Hajime Tsukada

Abstract: Data selection is an effective approach to domain adaptation in statistical machine translation. The idea is to use language models trained on small in-domain text to select similar sentences from large general-domain corpora, which are then incorporated into the training data. Substantial gains have been demonstrated in previous works, which employ standard ngram language models. Here, we explore the use of neural language models for data selection. We hypothesize that the continuous vector representation of words in neural language models makes them more effective than n-grams for modeling un- known word contexts, which are prevalent in general-domain text. In a comprehensive evaluation of 4 language pairs (English to German, French, Russian, Spanish), we found that neural language models are indeed viable tools for data selection: while the improvements are varied (i.e. 0.1 to 1.7 gains in BLEU), they are fast to train on small in-domain data and can sometimes substantially outperform conventional n-grams.

3 0.69409543 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

4 0.68598217 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

Author: Keisuke Sakaguchi ; Yuki Arase ; Mamoru Komachi

Abstract: We propose discriminative methods to generate semantic distractors of fill-in-theblank quiz for language learners using a large-scale language learners’ corpus. Unlike previous studies, the proposed methods aim at satisfying both reliability and validity of generated distractors; distractors should be exclusive against answers to avoid multiple answers in one quiz, and distractors should discriminate learners’ proficiency. Detailed user evaluation with 3 native and 23 non-native speakers of English shows that our methods achieve better reliability and validity than previous methods.

5 0.66263074 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding

Author: Kun Wang ; Chengqing Zong ; Keh-Yih Su

Abstract: Since statistical machine translation (SMT) and translation memory (TM) complement each other in matched and unmatched regions, integrated models are proposed in this paper to incorporate TM information into phrase-based SMT. Unlike previous multi-stage pipeline approaches, which directly merge TM result into the final output, the proposed models refer to the corresponding TM information associated with each phrase at SMT decoding. On a Chinese–English TM database, our experiments show that the proposed integrated Model-III is significantly better than either the SMT or the TM systems when the fuzzy match score is above 0.4. Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. Be- . sides, the proposed models also outperform previous approaches significantly.

6 0.66260958 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

7 0.65101105 289 acl-2013-QuEst - A translation quality estimation framework

8 0.64454758 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

9 0.62038952 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation

10 0.6202457 8 acl-2013-A Learner Corpus-based Approach to Verb Suggestion for ESL

11 0.61091036 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation

12 0.60506386 154 acl-2013-Extracting bilingual terminologies from comparable corpora

13 0.60325515 255 acl-2013-Name-aware Machine Translation

14 0.59899604 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

15 0.59505463 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

16 0.58990824 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

17 0.58862877 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

18 0.58761311 135 acl-2013-English-to-Russian MT evaluation campaign

19 0.5849942 359 acl-2013-Translating Dialectal Arabic to English

20 0.58281666 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (6, 0.05), (11, 0.06), (15, 0.016), (24, 0.042), (26, 0.062), (35, 0.108), (40, 0.236), (42, 0.062), (48, 0.034), (70, 0.046), (88, 0.032), (90, 0.04), (95, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87873065 308 acl-2013-Scalable Modified Kneser-Ney Language Model Estimation

Author: Kenneth Heafield ; Ivan Pouzyrevsky ; Jonathan H. Clark ; Philipp Koehn

Abstract: We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk. Using one machine with 140 GB RAM for 2.8 days, we built an unpruned model on 126 billion tokens. Machine translation experiments with this model show improvement of 0.8 BLEU point over constrained systems for the 2013 Workshop on Machine Translation task in three language pairs. Our algorithm is also faster for small models: we estimated a model on 302 million tokens using 7.7% of the RAM and 14.0% of the wall time taken by SRILM. The code is open source as part of KenLM.

2 0.85607475 94 acl-2013-Coordination Structures in Dependency Treebanks

Author: Martin Popel ; David Marecek ; Jan StÄłpanek ; Daniel Zeman ; ZdÄłnÄłk Zabokrtsky

Abstract: Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too.

3 0.81769872 163 acl-2013-From Natural Language Specifications to Program Input Parsers

Author: Tao Lei ; Fan Long ; Regina Barzilay ; Martin Rinard

Abstract: We present a method for automatically generating input parsers from English specifications of input file formats. We use a Bayesian generative model to capture relevant natural language phenomena and translate the English specification into a specification tree, which is then translated into a C++ input parser. We model the problem as a joint dependency parsing and semantic role labeling task. Our method is based on two sources of information: (1) the correlation between the text and the specification tree and (2) noisy supervision as determined by the success of the generated C++ parser in reading input examples. Our results show that our approach achieves 80.0% F-Score accu- , racy compared to an F-Score of 66.7% produced by a state-of-the-art semantic parser on a dataset of input format specifications from the ACM International Collegiate Programming Contest (which were written in English for humans with no intention of providing support for automated processing).1

4 0.80944854 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

Author: Matthew R. Gormley ; Jason Eisner

Abstract: Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent. We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ?) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum. As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.

same-paper 5 0.79166114 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

Author: Yuki Arase ; Ming Zhou

Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.

6 0.72433567 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

7 0.69498557 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

8 0.64093596 172 acl-2013-Graph-based Local Coherence Modeling

9 0.63911426 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization

10 0.63784897 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

11 0.6346122 250 acl-2013-Models of Translation Competitions

12 0.63434672 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

13 0.63317692 312 acl-2013-Semantic Parsing as Machine Translation

14 0.63305169 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

15 0.63099176 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions

16 0.63067579 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction

17 0.63063526 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

18 0.6302371 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

19 0.629677 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

20 0.62953496 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction