acl acl2013 acl2013-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lei Cui ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou
Abstract: The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). [sent-4, score-0.261]
2 Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. [sent-5, score-0.646]
3 Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. [sent-6, score-0.053]
4 To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. [sent-7, score-0.318]
5 The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. [sent-8, score-0.96]
6 End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks. [sent-9, score-0.153]
7 1 Introduction Statistical machine translation (SMT) depends on the amount of bilingual data and its quality. [sent-10, score-0.414]
8 In real-world SMT systems, bilingual data is often mined from the web where low-quality data is inevitable. [sent-11, score-0.296]
9 The low-quality bilingual data degrades the quality of word alignment and leads to the incorrect phrase pairs, which will hurt the translation performance of phrase-based SMT systems (Koehn et al. [sent-12, score-0.723]
10 Therefore, it is very important to exploit data quality information to improve the translation modeling. [sent-14, score-0.153]
11 Previous work on bilingual data cleaning often involves some supervised learning methods. [sent-15, score-0.393]
12 Several bilingual data mining systems (Resnik and ∗This work has been done while the first author ing Microsoft Research Asia. [sent-16, score-0.261]
13 Maximum entropy or SVM based classifiers are built to filter some non-parallel data or partial-parallel data. [sent-20, score-0.053]
14 Although these methods can filter some low-quality bilingual data, they need sufficient human labeled training instances to build the model, which may not be easy to acquire. [sent-21, score-0.314]
15 To this end, we propose an unsupervised approach to clean the bilingual data. [sent-22, score-0.318]
16 It is intuitive that high-quality parallel data tends to produce better phrase pairs than low-quality data. [sent-23, score-0.39]
17 Meanwhile, it is also observed that the phrase pairs that appear frequently in the bilingual corpus are more reliable than less frequent ones because they are more reusable, hence most good sentence pairs are prone to contain more frequent phrase pairs (Foster et al. [sent-24, score-1.068]
18 This kind of mutual reinforcement fits well into the framework of graph-based random walk. [sent-27, score-0.281]
19 When a phrase pair p is extracted from a sentence pair s, s is considered casting a vote for p. [sent-28, score-0.447]
20 The higher the number of votes a phrase pair has, the more reliable of the phrase pair. [sent-29, score-0.518]
21 Similarly, the quality of the sentence pair s is determined by the number of votes casted by the extracted phrase pairs from s. [sent-30, score-0.522]
22 In this paper, a PageRank-style random walk algorithm (Brin and Page, 1998; Mihalcea and Tarau, 2004; Wan et al. [sent-31, score-0.221]
23 , 2007) is conducted to iteratively compute the importance score of each sentence pair that indicates its quality: the higher the better. [sent-32, score-0.318]
24 Unlike other data filtering methods, our proposed method utilizes the importance scores of sentence pairs as fractional counts to calculate the phrase translation probabilities based on Maximum Likelihood Estimation (MLE), thereby none of the bilingual data is filtered out. [sent-33, score-1.225]
25 Experimental results show that our proposed approach substantially improves the performance in large-scale Chinese-to-English translation tasks. [sent-34, score-0.153]
26 1 Graph-based random walk Graph-based random walk is a general algorithm to approximate the importance of a vertex within the graph in a global view. [sent-38, score-0.794]
27 In our method, the vertices denote the sentence pairs and phrase pairs. [sent-39, score-0.503]
28 The importance of each vertex is propagated to other vertices along the edges. [sent-40, score-0.45]
29 Depending on different scenarios, the graph can take directed or undirected, weighted or un-weighted forms. [sent-41, score-0.057]
30 Starting from the initial scores assigned in the graph, the algorithm is applied to recursively compute the importance scores of vertices until it converges, or the difference between two consecutive iterations falls below a pre-defined threshold. [sent-42, score-0.421]
31 2 Graph construction Given the sentence pairs that are word-aligned automatically, an undirected, weighted bipartite graph is constructed which maps the sentence pairs and the extracted phrase pairs to the vertices. [sent-44, score-0.791]
32 An edge between a sentence pair vertex and a phrase pair vertex is added if the phrase pair can be extracted from the sentence pair. [sent-45, score-1.2]
33 Mutual reinforcement scores are defined on edges, through which the importance scores are propagated between vertices. [sent-46, score-0.389]
34 Formally, the bipartite graph is defined as: G = (V, E) where V = S ∪ P is the vertex set, S = {si |1 ≤ wi ≤ n} i=s tShe ∪ set osf t aell v esretnetxen sceet, pairs. [sent-48, score-0.276]
35 P|1 = {pj | ≤1 ≤ j ≤ m} oisf t ahlel sseetn eofn caell phrase pairs {wphi|c1h are jex ≤tra mcte}d sfr tohme sSe t b oafse adll on tahsee w paoirrds alignment. [sent-49, score-0.306]
36 E is the edge set in which the edges are between S and P, thereby E = {hsi, pji |si ∈ S, pj ∈ P, φ(si, pj) = 1}. [sent-50, score-0.71]
37 φ(si,pj) =(01 iofth pejrcwainse be extracted from si 2. [sent-51, score-0.259]
38 where PF(si, pj) is the phrase pair frequency in a sentence pair and IPF(pj) is the inverse phrase pair frequency of pj in the whole bilingual corpus. [sent-54, score-1.418]
39 , 2007), we compute the importance scores of sentence pairs and phrase pairs using a PageRank-style algorithm. [sent-57, score-0.677]
40 Let u(si) and v(pj) denote the scores of a sentence pair vertex and a phrase pair vertex. [sent-59, score-0.69]
41 85 that is same as the original PageRank, N(si) = {j | hsi, pji ∈ E}, M(pj) = {i| hsi, pji ∈ E}. [sent-61, score-0.288]
42 Algorithm 1 iteratively updates the scores of sentence pairs and phrase pairs (lines 10-26). [sent-64, score-0.599]
43 The computation ends when difference between two consecutive iterations is lower than a pre-defined threshold δ (10−12 in this study). [sent-65, score-0.066]
44 4 Parallelization When the random walk runs on some large bilin- gual corpora, even filtering phrase pairs that appear only once would still require several days of 1 CPU time for a number of iterations. [sent-67, score-0.608]
45 for al)l j← ←∈ 0 N(si) do − F(si) ← F(si) + Pk∈Mri(jpj)rkj · v(pj)(n−1) end for u(si) (n) ← (1 − d) + d · F(si) end for for all j ∈ {0 . [sent-84, score-0.072]
46 Before the iterative computation starts, the sum of the outlink weights for each vertex is computed first. [sent-90, score-0.226]
47 The edges are randomly partitioned into sets of roughly equal size. [sent-91, score-0.073]
48 Each edge hsi, pji can generate two key-value pairs eind gthee h sformait hsi, riji raantde hpj , riji . [sent-92, score-0.507]
49 aTluhee pairs iwnit hth eth feo same key are asnudmm hped locally and accumulated across different machines. [sent-93, score-0.115]
50 Then, in each iteration, the score of each vertex is updated according to the sum of the normalized inlink weights. [sent-94, score-0.181]
51 The key-value pairs are gener- ated in the format hsi, Pk∈Mri(jpj)rkj · v(pj)i and hpj, Pk∈Nri(jsi)rik · u(si)iP. [sent-95, score-0.115]
52 These key-value pairs are aPlso randomly partitioned and summed across different machines. [sent-96, score-0.154]
53 Since long sentence pairs usually extract more phrase pairs, we need to normalize the importance scores based on the sentence length. [sent-97, score-0.642]
54 5 Integration into translation modeling After sufficient number of iterations, the importance scores of sentence pairs (i. [sent-100, score-0.524]
55 Instead of simple filtering, we use the scores of sentence pairs as the fractional counts to re-estimate the translation probabilities of phrase pairs. [sent-103, score-0.7]
56 Given a phrase pair p = hf¯, ei, A(f¯) and B(¯ e) i Gndiivceante a t pheh sets poafi rse pnt =en hcefs, ethi,at A f(¯ and e appear. [sent-104, score-0.279]
57 Then the translation probability is defined as: PCW(f¯| e¯) =Pi∈PA(jf∈¯)B∩B(¯ e() e¯)u(us(js)i) × × c cj(i( e¯f)¯, e¯ ) where ci (·) denotesP the count of the phrase or phrase pair i dne si. [sent-105, score-0.623]
58 sPC thWe( cf¯o| e¯u)n tan odf PthCeW p(¯h er|fa¯s)e are named as Corpus Weighting (CW) bPase(d¯ e |translation probability, which are integrated into the loglinear model in addition to the conventional phrase translation probabilities (Koehn et al. [sent-106, score-0.379]
59 1 Setup We evaluated our bilingual data cleaning approach on large-scale Chinese-to-English machine translation tasks. [sent-109, score-0.546]
60 The bilingual data we used was mainly mined from the web (Jiang et al. [sent-110, score-0.296]
61 , 2009)1 , as well as the United Nations parallel corpus released by LDC and the parallel corpus released by China Workshop on Machine Translation (CWMT), which contain around 30 million sentence pairs in total after removing duplicated ones. [sent-111, score-0.429]
62 A phrase-based decoder was implemented based on inversion transduction grammar (Wu, 1997). [sent-114, score-0.125]
63 The performance of this decoder is similar to the state-of-the-art phrase-based decoder in Moses, but the implementation is more straightforward. [sent-115, score-0.076]
64 We use the following feature functions in the log-linear model: 1Although supervised data cleaning has been done in the post-processing, the corpus still contains a fair amount of noisy data based on our random sampling. [sent-116, score-0.266]
65 216 8094 Table 2: BLEU(%) of Chinese-to-English translation tasks on multiple testing datasets (p ”-numberM” < 0. [sent-127, score-0.153]
66 05), where denotes we simply filter number million low scored sentence pairs from the bilingual data and use others to extract the phrase table. [sent-128, score-0.7]
67 ”CW” means the corpus weighting feature, which incorporates sentence scores from random walk as fractional counts to re-estimate the phrase translation probabilities. [sent-129, score-0.771]
68 • • phrase translation probabilities and lexical weights rina nbsolathti doinre pctroiobnasb (4 features); 5-gram language model with Kneser-Ney smoothing (1 feature); • lexicalized reordering model (1 feature); • phrase count and word count (2 features). [sent-130, score-0.644]
69 The translation model was trained over the word-aligned bilingual corpus conducted by GIZA++ (Och and Ney, 2003) in both directions, and the diag-grow-final heuristic was used to refine the symmetric word alignment. [sent-131, score-0.414]
70 , 2006) was trained over the 40% randomly sampled sentence pairs from our parallel data. [sent-135, score-0.279]
71 In the baseline system, the phrase pairs that appear only once in the bilingual data are simply discarded because most of them are noisy. [sent-142, score-0.567]
72 The results show weijing tansuo de xin lingyu 未经 探索 的 新 领域 未经 探索 的 新 领域 uncharted waters unexplored new areas Figure 2: The left one is the non-literal translation in our bilingual corpus. [sent-148, score-0.549]
73 The right one is the literal translation made by human for comparison. [sent-149, score-0.194]
74 that the ”leaving-one-out” method performs almost the same as our baseline, thereby cannot bring other benefits to the system. [sent-150, score-0.069]
75 3 Results We evaluate the proposed bilingual data cleaning method by incorporating sentence scores into translation modeling. [sent-152, score-0.688]
76 In addition, we also compare with several settings that filtering low-quality sentence pairs from the bilingual data based on the importance scores. [sent-153, score-0.651]
77 5M, 1M } sentence pairs are sfitlt eNred = b {ef o0r. [sent-156, score-0.195]
78 Although ftihltee simple r beil tihnegual data filtering can improve the performance on some datasets, it is difficult to determine the bor- der line and translation performance is fluctuated. [sent-158, score-0.234]
79 One main reason is in the proposed random walk approach, the bilingual sentence pairs with nonliteral translations may get lower scores because they appear less frequently compared with those literal translations. [sent-159, score-0.78]
80 Crudely filtering out these data may degrade the translation performance. [sent-160, score-0.275]
81 For example, we have a sentence pair in the bilingual corpus shown in the left part of Figure 2. [sent-161, score-0.429]
82 Although the translation is correct in this situation, translating the Chinese word ”lingyu” to ”waters” appears very few times since the common translations are ”areas” or ”fields”. [sent-162, score-0.153]
83 However, simply filtering out this kind of sentence pairs may lead to some loss of native English expressions, thereby the trans343 lation performance is unstable since both nonparallel sentence pairs and non-literal but parallel sentence pairs are filtered. [sent-163, score-0.819]
84 Therefore, we use the importance score of each sentence pair to estimate the phrase translation probabilities. [sent-164, score-0.626]
85 It consistently brings substantial improvements compared to the baseline, which demonstrates graph-based random walk indeed improves the translation modeling performance for our SMT system. [sent-165, score-0.374]
86 , 2012), they evaluated phrasebased SMT systems trained on parallel data with different proportions of synthetic noisy data. [sent-168, score-0.161]
87 They suggested that when collecting larger, noisy parallel data for training phrase-based SMT, cleaning up by trying to detect and remove incorrect alignments can actually degrade performance. [sent-169, score-0.366]
88 Based on our method, sometimes filtering noisy data leads to unexpected results. [sent-171, score-0.158]
89 The reason is two-fold: on the one hand, the non-literal parallel data makes false positive in noisy data detection; on the other hand, large-scale SMT systems is relatively robust and tolerant to noisy data, especially when we remove frequency1phrase pairs. [sent-172, score-0.238]
90 Therefore, we propose to integrate the importance scores when re-estimating phrase pair probabilities in this paper. [sent-173, score-0.49]
91 The importance scores can be considered as a kind of contribution constraint, thereby high-quality parallel data contributes more while noisy parallel data contributes less. [sent-174, score-0.556]
92 4 Conclusion and Future Work In this paper, we develop an effective approach to clean the bilingual data using graph-based random walk. [sent-175, score-0.375]
93 For future work, we will extend our method to explore the relationships of sentence-to-sentence and phrase-to-phrase, which is beyond the existing sentence-to-phrase mutual reinforcement. [sent-177, score-0.075]
94 The impact of sentence alignment errors on phrase-based machine translation performance. [sent-195, score-0.272]
95 Mining bilingual data from the web with adaptively learnt patterns. [sent-200, score-0.261]
96 A sys- tematic comparison of various statistical alignment models. [sent-224, score-0.039]
97 A dom tree alignment model for mining parallel data from the web. [sent-246, score-0.123]
98 Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. [sent-251, score-0.158]
99 Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. [sent-256, score-0.432]
100 Maximum entropy based phrase reordering model for statistical machine translation. [sent-265, score-0.229]
wordName wordTfidf (topN-words)
[('pj', 0.431), ('bilingual', 0.261), ('si', 0.259), ('phrase', 0.191), ('vertex', 0.181), ('walk', 0.164), ('translation', 0.153), ('hsi', 0.145), ('pji', 0.144), ('cleaning', 0.132), ('pk', 0.12), ('vertices', 0.117), ('pairs', 0.115), ('importance', 0.114), ('reinforcement', 0.113), ('jpj', 0.108), ('jsi', 0.108), ('mri', 0.108), ('nri', 0.108), ('wuebker', 0.108), ('smt', 0.099), ('pair', 0.088), ('parallel', 0.084), ('filtering', 0.081), ('sentence', 0.08), ('brin', 0.079), ('noisy', 0.077), ('mutual', 0.075), ('hpj', 0.072), ('ipf', 0.072), ('lingyu', 0.072), ('riji', 0.072), ('rkj', 0.072), ('thereby', 0.069), ('och', 0.066), ('fractional', 0.064), ('waters', 0.063), ('rik', 0.063), ('scores', 0.062), ('graph', 0.057), ('random', 0.057), ('clean', 0.057), ('wan', 0.057), ('munteanu', 0.055), ('dekai', 0.055), ('filter', 0.053), ('tarau', 0.052), ('foster', 0.05), ('cw', 0.048), ('mapreduce', 0.048), ('votes', 0.048), ('dean', 0.047), ('degrades', 0.047), ('goutte', 0.047), ('ming', 0.046), ('australia', 0.046), ('franz', 0.045), ('josef', 0.045), ('iterative', 0.045), ('transduction', 0.044), ('mihalcea', 0.044), ('sydney', 0.044), ('xiong', 0.043), ('inversion', 0.043), ('harbin', 0.043), ('literal', 0.041), ('degrade', 0.041), ('koehn', 0.04), ('association', 0.04), ('partitioned', 0.039), ('undirected', 0.039), ('alignment', 0.039), ('bipartite', 0.038), ('propagated', 0.038), ('jiang', 0.038), ('decoder', 0.038), ('reordering', 0.038), ('zhou', 0.037), ('end', 0.036), ('fits', 0.036), ('smoothing', 0.036), ('lei', 0.036), ('stand', 0.036), ('iteratively', 0.036), ('hermann', 0.035), ('mined', 0.035), ('probabilities', 0.035), ('spain', 0.035), ('shi', 0.035), ('edges', 0.034), ('consecutive', 0.034), ('ldc', 0.034), ('bleu', 0.033), ('contributes', 0.033), ('released', 0.033), ('edge', 0.032), ('al', 0.032), ('incorrect', 0.032), ('iterations', 0.032), ('wu', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
Author: Lei Cui ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou
Abstract: The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks.
2 0.24811092 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
Author: Jiajun Zhang ; Chengqing Zong
Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1
3 0.19745395 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
Author: Conghui Zhu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Typical statistical machine translation systems are batch trained with a given training data and their performances are largely influenced by the amount of data. With the growth of the available data across different domains, it is computationally demanding to perform batch training every time when new data comes. In face of the problem, we propose an efficient phrase table combination method. In particular, we train a Bayesian phrasal inversion transduction grammars for each domain separately. The learned phrase tables are hierarchically combined as if they are drawn from a hierarchical Pitman-Yor process. The performance measured by BLEU is at least as comparable to the traditional batch training method. Furthermore, each phrase table is trained separately in each domain, and while computational overhead is significantly reduced by training them in parallel.
4 0.13721265 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
Author: Jason R. Smith ; Herve Saint-Amand ; Magdalena Plamada ; Philipp Koehn ; Chris Callison-Burch ; Adam Lopez
Abstract: Parallel text is the fuel that drives modern machine translation systems. The Web is a comprehensive source of preexisting parallel text, but crawling the entire web is impossible for all but the largest companies. We bring web-scale parallel text to the masses by mining the Common Crawl, a public Web crawl hosted on Amazon’s Elastic Cloud. Starting from nothing more than a set of common two-letter language codes, our open-source extension of the STRAND algorithm mined 32 terabytes of the crawl in just under a day, at a cost of about $500. Our large-scale experiment uncovers large amounts of parallel text in dozens of language pairs across a variety of domains and genres, some previously unavailable in curated datasets. Even with minimal cleaning and filtering, the resulting data boosts translation performance across the board for five different language pairs in the news domain, and on open domain test sets we see improvements of up to 5 BLEU. We make our code and data available for other researchers seeking to mine this rich new data resource.1
5 0.12678255 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa
Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.
6 0.12669274 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning
7 0.12598573 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
8 0.11902806 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
9 0.11891647 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model
10 0.11707742 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
11 0.11203414 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
12 0.11166204 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
13 0.11091966 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
14 0.10919435 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
15 0.10846776 255 acl-2013-Name-aware Machine Translation
16 0.10814913 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
17 0.10659707 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
18 0.10653789 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
19 0.1062914 154 acl-2013-Extracting bilingual terminologies from comparable corpora
20 0.10526869 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
topicId topicWeight
[(0, 0.245), (1, -0.135), (2, 0.213), (3, 0.089), (4, 0.032), (5, 0.024), (6, -0.033), (7, 0.025), (8, -0.023), (9, -0.016), (10, 0.041), (11, -0.013), (12, -0.009), (13, 0.049), (14, 0.062), (15, 0.026), (16, 0.006), (17, -0.019), (18, -0.042), (19, 0.012), (20, -0.014), (21, -0.064), (22, 0.047), (23, 0.014), (24, -0.012), (25, 0.014), (26, -0.035), (27, 0.103), (28, 0.03), (29, 0.046), (30, -0.022), (31, -0.044), (32, -0.041), (33, 0.038), (34, 0.045), (35, -0.006), (36, -0.011), (37, -0.073), (38, -0.04), (39, -0.091), (40, 0.095), (41, -0.017), (42, -0.012), (43, 0.001), (44, 0.013), (45, -0.025), (46, 0.041), (47, -0.014), (48, -0.012), (49, -0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.96562004 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
Author: Lei Cui ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou
Abstract: The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks.
Author: Jiajun Zhang ; Chengqing Zong
Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1
3 0.81936276 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
Author: Majid Razmara ; Maryam Siahbani ; Reza Haffari ; Anoop Sarkar
Abstract: Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation especially when a limited amount of parallel text is available for training or when there is a domain shift from training data to test data. In this paper, we propose a novel approach to finding translations for oov words. We induce a lexicon by constructing a graph on source language monolingual text and employ a graph propagation technique in order to find translations for all the source language phrases. Our method differs from previous approaches by adopting a graph propagation approach that takes into account not only one-step (from oov directly to a source language phrase that has a translation) but multi-step paraphrases from oov source language words to other source language phrases and eventually to target language translations. Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.
4 0.80888158 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
Author: Conghui Zhu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Typical statistical machine translation systems are batch trained with a given training data and their performances are largely influenced by the amount of data. With the growth of the available data across different domains, it is computationally demanding to perform batch training every time when new data comes. In face of the problem, we propose an efficient phrase table combination method. In particular, we train a Bayesian phrasal inversion transduction grammars for each domain separately. The learned phrase tables are hierarchically combined as if they are drawn from a hierarchical Pitman-Yor process. The performance measured by BLEU is at least as comparable to the traditional batch training method. Furthermore, each phrase table is trained separately in each domain, and while computational overhead is significantly reduced by training them in parallel.
5 0.8075974 201 acl-2013-Integrating Translation Memory into Phrase-Based Machine Translation during Decoding
Author: Kun Wang ; Chengqing Zong ; Keh-Yih Su
Abstract: Since statistical machine translation (SMT) and translation memory (TM) complement each other in matched and unmatched regions, integrated models are proposed in this paper to incorporate TM information into phrase-based SMT. Unlike previous multi-stage pipeline approaches, which directly merge TM result into the final output, the proposed models refer to the corresponding TM information associated with each phrase at SMT decoding. On a Chinese–English TM database, our experiments show that the proposed integrated Model-III is significantly better than either the SMT or the TM systems when the fuzzy match score is above 0.4. Furthermore, integrated Model-III achieves overall 3.48 BLEU points improvement and 2.62 TER points reduction in comparison with the pure SMT system. Be- . sides, the proposed models also outperform previous approaches significantly.
6 0.79013664 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
7 0.74683267 338 acl-2013-Task Alternation in Parallel Sentence Retrieval for Twitter Translation
8 0.74646229 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
9 0.74609733 255 acl-2013-Name-aware Machine Translation
10 0.74090934 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
11 0.73934805 154 acl-2013-Extracting bilingual terminologies from comparable corpora
12 0.72342962 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
13 0.70974594 214 acl-2013-Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
14 0.69794112 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
15 0.69500154 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
16 0.69217873 77 acl-2013-Can Markov Models Over Minimal Translation Units Help Phrase-Based SMT?
17 0.68046254 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
18 0.67762214 69 acl-2013-Bilingual Lexical Cohesion Trigger Model for Document-Level Machine Translation
19 0.67187661 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
20 0.67083067 383 acl-2013-Vector Space Model for Adaptation in Statistical Machine Translation
topicId topicWeight
[(0, 0.049), (6, 0.054), (11, 0.087), (13, 0.141), (14, 0.015), (15, 0.011), (24, 0.066), (26, 0.034), (35, 0.046), (40, 0.012), (42, 0.082), (48, 0.048), (56, 0.018), (64, 0.015), (70, 0.053), (77, 0.012), (88, 0.029), (90, 0.067), (95, 0.09)]
simIndex simValue paperId paperTitle
1 0.90357888 386 acl-2013-What causes a causal relation? Detecting Causal Triggers in Biomedical Scientific Discourse
Author: Claudiu Mihaila ; Sophia Ananiadou
Abstract: Current domain-specific information extraction systems represent an important resource for biomedical researchers, who need to process vaster amounts of knowledge in short times. Automatic discourse causality recognition can further improve their workload by suggesting possible causal connections and aiding in the curation of pathway models. We here describe an approach to the automatic identification of discourse causality triggers in the biomedical domain using machine learning. We create several baselines and experiment with various parameter settings for three algorithms, i.e., Conditional Random Fields (CRF), Support Vector Machines (SVM) and Random Forests (RF). Also, we evaluate the impact of lexical, syntactic and semantic features on each of the algorithms and look at er- rors. The best performance of 79.35% F-score is achieved by CRFs when using all three feature types.
2 0.88424349 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach
Author: Veronika Vincze ; Istvan Nagy T. ; Richard Farkas
Abstract: Here, we introduce a machine learningbased approach that allows us to identify light verb constructions (LVCs) in Hungarian and English free texts. We also present the results of our experiments on the SzegedParalellFX English–Hungarian parallel corpus where LVCs were manually annotated in both languages. With our approach, we were able to contrast the performance of our method and define language-specific features for these typologically different languages. Our presented method proved to be sufficiently robust as it achieved approximately the same scores on the two typologically different languages.
same-paper 3 0.87122422 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
Author: Lei Cui ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou
Abstract: The quality of bilingual data is a key factor in Statistical Machine Translation (SMT). Low-quality bilingual data tends to produce incorrect translation knowledge and also degrades translation modeling performance. Previous work often used supervised learning methods to filter lowquality data, but a fair amount of human labeled examples are needed which are not easy to obtain. To reduce the reliance on labeled examples, we propose an unsupervised method to clean bilingual data. The method leverages the mutual reinforcement between the sentence pairs and the extracted phrase pairs, based on the observation that better sentence pairs often lead to better phrase extraction and vice versa. End-to-end experiments show that the proposed method substantially improves the performance in largescale Chinese-to-English translation tasks.
4 0.86746866 322 acl-2013-Simple, readable sub-sentences
Author: Sigrid Klerke ; Anders Sgaard
Abstract: We present experiments using a new unsupervised approach to automatic text simplification, which builds on sampling and ranking via a loss function informed by readability research. The main idea is that a loss function can distinguish good simplification candidates among randomly sampled sub-sentences of the input sentence. Our approach is rated as equally grammatical and beginner reader appropriate as a supervised SMT-based baseline system by native speakers, but our setup performs more radical changes that better resembles the variation observed in human generated simplifications.
5 0.7861734 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
Author: Muhua Zhu ; Yue Zhang ; Wenliang Chen ; Min Zhang ; Jingbo Zhu
Abstract: Shift-reduce dependency parsers give comparable accuracies to their chartbased counterparts, yet the best shiftreduce constituent parsers still lag behind the state-of-the-art. One important reason is the existence of unary nodes in phrase structure trees, which leads to different numbers of shift-reduce actions between different outputs for the same input. This turns out to have a large empirical impact on the framework of global training and beam search. We propose a simple yet effective extension to the shift-reduce process, which eliminates size differences between action sequences in beam-search. Our parser gives comparable accuracies to the state-of-the-art chart parsers. With linear run-time complexity, our parser is over an order of magnitude faster than the fastest chart parser.
6 0.78303576 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
7 0.77136421 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
8 0.76556861 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
9 0.765558 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning
10 0.76476371 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
11 0.7645911 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
12 0.76454151 333 acl-2013-Summarization Through Submodularity and Dispersion
13 0.76433694 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
14 0.76248038 80 acl-2013-Chinese Parsing Exploiting Characters
15 0.76181227 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization
16 0.76149756 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
17 0.75940216 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
18 0.75890642 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
19 0.75804454 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models
20 0.75797999 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce