emnlp emnlp2011 emnlp2011-25 knowledge-graph by maker-knowledge-mining

25 emnlp-2011-Cache-based Document-level Statistical Machine Translation


Source: pdf

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. [sent-3, score-0.53]

2 Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. [sent-5, score-0.569]

3 source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. [sent-8, score-0.771]

4 In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. [sent-9, score-0.056]

5 Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0. [sent-10, score-0.169]

6 Especially, detailed analysis and discussion are presented to give new insights to document-level translation. [sent-12, score-0.041]

7 Bond (2002) suggested nine ways to improve machine translation by imitating the best practices of human translators (Nida, 1964), with parsing the entire document before translation as the first priority. [sent-19, score-0.579]

8 However, most SMT systems still treat parallel corpora as a list of independent sentence-pairs and ignore document-level information. [sent-20, score-0.066]

9 Document-level information can and should be used to help document-level machine translation. [sent-21, score-0.034]

10 At least, the topic of a document can help choose specific translation candidates, since when taken out of the context from their document, some words, phrases and even sentences may be rather ambiguous and thus difficult to understand. [sent-22, score-0.483]

11 Another advantage of document-level machine translation is its ability in keeping a consistent translation. [sent-23, score-0.169]

12 However, document-level translation has drawn little attention from the SMT research community. [sent-24, score-0.169]

13 First of all, most of parallel corpora lack the annotation of document boundaries (Tam, 2007). [sent-26, score-0.22]

14 Thirdly, reference translations of a test document written by human translators tend to have flexible expressions in order to avoid producing monotonous texts. [sent-28, score-0.266]

15 Tiedemann (2010) showed that the repetition and consistency are very important when modeling natural language and translation. [sent-30, score-0.028]

16 He proposed to employ cache-based language and translation models in a phrase-based SMT system for domain Proce Ed iningbsu orfg th ,e S 2c0o1tl1an Cdo,n UfeKr,en Jcuely on 27 E–m31p,ir 2ic0a1l1 M. [sent-31, score-0.169]

17 Especially, the cache in the translation model dynamically grows up by adding bilingual phrase pairs from the best translation hypotheses of previous sentences. [sent-34, score-1.592]

18 One problem with the dynamic cache is that those initial sentences in a test document may not benefit from the dynamic cache. [sent-35, score-1.163]

19 Another problem is that the dynamic cache may be prone to noise and cause error propagation. [sent-36, score-0.88]

20 This explains why the dynamic cache fails to much improve the performance. [sent-37, score-0.836]

21 This paper proposes a cache-based approach for document-level SMT using a static cache and a dynamic cache. [sent-38, score-1.123]

22 In particular, the static cache is employed to store relevant bilingual phrase pairs extracted from similar bilingual document pairs (i. [sent-40, score-2.062]

23 source documents similar to the test document and their target counterparts) in the training parallel corpus while the dynamic cache is employed to store bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document. [sent-42, score-1.934]

24 In this way, our cache-based approach can provide useful data at the beginning of the translation process via the static cache. [sent-43, score-0.481]

25 As the translation process continues, the dynamic cache grows and contributes more and more to the translation of subsequent sentences. [sent-44, score-1.329]

26 Our motivation to employ similar bilingual document pairs in the training parallel corpus is simple: a human translator often collects similar bilingual document pairs to help translation. [sent-45, score-1.208]

27 If there are translation pairs of sentences/phrases/words in similar bilingual document pairs, this makes the translation much easier. [sent-46, score-0.853]

28 Given a test document, our approach imitates this procedure by first retrieving similar bilingual document pairs from the training parallel corpus, which has often been applied in IR-based adaptation of SMT systems (Zhao et al. [sent-47, score-0.601]

29 2007) and then extracting bilingual phrase pairs from similar bilingual document pairs to store them in a static cache. [sent-50, score-1.332]

30 However, such a cache-based approach may introduce many noisy/unnecessary bilingual phrase pairs in both the static and dynamic caches. [sent-51, score-0.907]

31 In order to resolve this problem, this paper employs a topic model to weaken those noisy/unnecessary bilingual phrase pairs by recommending the decoder to choose most likely phrase pairs according to the topic words extracted from the target-side 910 text of similar bilingual document pairs. [sent-52, score-1.328]

32 Just like a human translator, even with a big bilingual dictionary, is often confused when he meets a source phrase which corresponds to several possible translations. [sent-53, score-0.421]

33 In this case, some topic words can help reduce the perplexity. [sent-54, score-0.16]

34 In this paper, the topic words are stored in a topic cache. [sent-55, score-0.252]

35 In some sense, it has the similar effect of employing an adaptive language model with the advantage of avoiding the interpolation of a global language model with a specific domain language model. [sent-56, score-0.02]

36 Section 3 presents our cache-based approach to documentlevel SMT. [sent-59, score-0.091]

37 Session 5 gives new insights on cache- based document-level translation. [sent-61, score-0.041]

38 (2006) assumed that the parallel sentence pairs within a document pair constitute a mixture of hidden topics and each word pair follows a topic-specific bilingual translation model. [sent-68, score-0.75]

39 It shows that the performance of word alignment can be improved with the help of document-level information, which indirectly improves the quality of SMT. [sent-69, score-0.034]

40 (2007) proposed a bilingual-LSA model on the basis of a parallel document corpus and built a topic-based language model for each language. [sent-71, score-0.22]

41 By automatically building the correspondence between the source and target language models, this method can match the topic-based language model and improve the performance of SMT. [sent-72, score-0.056]

42 Carpuat (2009) revisited the “one sense per discourse” hypothesis of Gale et al. [sent-73, score-0.03]

43 (1992) and gave a detailed comparison and analysis of the “one translation per discourse” hypothesis. [sent-74, score-0.169]

44 However, she failed to propose an effective way to integrate document-level information into a SMT system. [sent-75, score-0.02]

45 For example, she simply recommended some translation candidates to replace some target words in the post-process stage. [sent-76, score-0.225]

46 Basically, the cache is analogous to “cache memory” in hardware terminology, which tracks short-term fluctuation (Iyer et al. [sent-78, score-0.747]

47 As the cache changes with different documents, the documentlevel information should be capable of influencing SMT. [sent-80, score-0.784]

48 Previous cache-based approaches mainly point to cache-based language modeling (Kuhn and Mori, 1990), which uses a large global language model to mix with a small local model estimated from recent history data. [sent-81, score-0.054]

49 However, applying such a language model in SMT is very difficult due to the risk of introducing extra noise (Raab, 2007). [sent-82, score-0.024]

50 (2004) explored user-edited translations in the context of interactive machine translation. [sent-84, score-0.021]

51 Tiedemann (2010) proposed to fill the cache with bi- lingual phrase pairs from the best translation hypotheses of previous sentences in the test document. [sent-85, score-1.119]

52 (2004) and Tiedemann (2010) also explored traditional cache-based language models and found that a cache-based language model often contributes much more than a cache-based translation model. [sent-87, score-0.231]

53 In this way, our cache-based approach can provide useful data at the beginning of the translation process via the static cache. [sent-89, score-0.481]

54 As the translation process continues, the dynamic cache grows and contributes more and more to the translation of subsequent sentences. [sent-90, score-1.329]

55 Besides, the possibility of 911 choosing noisy/unnecessary bilingual phrase pairs in both the static and dynamic caches is wakened with the help of the topic words in the topic cache. [sent-91, score-1.405]

56 In particular, only the most similar document pair is used to construct the static cache and the topic cache unless specified. [sent-92, score-1.893]

57 In this section, we first introduce the basic phrase-based SMT system and then present our cache-based approach to achieve document-level SMT with focus on constructing the caches (static, dynamic and topic) and designing their corresponding features. [sent-93, score-0.404]

58 1 Basic phrase-based SMT system It is well known that the translation process of SMT can be modeled as obtaining the best translation e of the source sentence f by maximizing following posterior probability (Brown et al. [sent-95, score-0.389]

59 , 1993): ebest = argmaxP(e | f)= argmaxP(f | e)Plm(e) (1) e e where P(e|f) is a translation model and Plm is a language model. [sent-96, score-0.24]

60 In principle, a phrase-based SMT system can provide the best phrase segmentation and align- ment that cover a bilingual sentence pair. [sent-100, score-0.399]

61 Here, a segmentation of a sentence into K phrases is defined as: (f~e)≈ ∑ (f ,e ,~) (3) where tuple (f , e ) refers to a phrase pair, and ~ indicates corresponding alignment information. [sent-101, score-0.111]

62 2 Dynamic Cache Our dynamic cache is mostly inspired by Tiedemann (2010), which adopts a dynamic cache to store relevant bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document. [sent-103, score-2.51]

63 In particular, a specific feature is incorporated S to capture useful documentlevel information in the dynamic cache: Scache(ec|fc)=∑Ki=1I(<∑ec,Ki=f1cI(>=fc<=ei,fif)i>)×e−∂i(4) where e−∂i is a decay factor to avoid the dependence of the feature’s contribution on the cache size. [sent-104, score-0.976]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cache', 0.663), ('bilingual', 0.288), ('static', 0.287), ('smt', 0.228), ('caches', 0.212), ('dynamic', 0.173), ('translation', 0.169), ('document', 0.154), ('topic', 0.126), ('tiedemann', 0.122), ('dds', 0.106), ('documentlevel', 0.091), ('phrase', 0.086), ('store', 0.083), ('tam', 0.083), ('pairs', 0.073), ('ebest', 0.071), ('nepveu', 0.071), ('pphr', 0.071), ('parallel', 0.066), ('contributes', 0.062), ('hypotheses', 0.061), ('translators', 0.061), ('grows', 0.059), ('stores', 0.057), ('translator', 0.055), ('argmaxp', 0.055), ('carpuat', 0.055), ('plm', 0.051), ('fills', 0.051), ('pw', 0.048), ('dx', 0.048), ('fill', 0.043), ('fc', 0.043), ('zhao', 0.042), ('insights', 0.041), ('adopts', 0.041), ('hm', 0.04), ('relevant', 0.037), ('subsequent', 0.034), ('help', 0.034), ('continues', 0.033), ('documents', 0.032), ('target', 0.032), ('mainly', 0.031), ('monotonous', 0.03), ('fluctuation', 0.03), ('hardware', 0.03), ('influencing', 0.03), ('revisited', 0.03), ('employed', 0.03), ('principle', 0.029), ('kinds', 0.028), ('discourse', 0.028), ('iyer', 0.028), ('recommending', 0.028), ('infocomm', 0.028), ('mzhang', 0.028), ('repetition', 0.028), ('retrieves', 0.028), ('obtaining', 0.027), ('decay', 0.026), ('argmax', 0.026), ('practices', 0.026), ('segmentation', 0.025), ('beginning', 0.025), ('penalty', 0.025), ('lingual', 0.024), ('expands', 0.024), ('continuously', 0.024), ('dynamically', 0.024), ('recommended', 0.024), ('tracks', 0.024), ('wp', 0.024), ('source', 0.024), ('noise', 0.024), ('besides', 0.023), ('bond', 0.023), ('dependence', 0.023), ('meets', 0.023), ('collects', 0.023), ('ki', 0.023), ('mix', 0.023), ('mori', 0.023), ('switching', 0.023), ('thirdly', 0.022), ('basically', 0.022), ('interpolate', 0.022), ('translations', 0.021), ('decade', 0.021), ('directions', 0.02), ('och', 0.02), ('failed', 0.02), ('retrieving', 0.02), ('session', 0.02), ('avoiding', 0.02), ('prone', 0.02), ('translates', 0.02), ('kuhn', 0.02), ('designing', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

2 0.1882188 118 emnlp-2011-SMT Helps Bitext Dependency Parsing

Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li

Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.

3 0.14236647 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

4 0.089371637 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

Author: Weiwei Guo ; Mona Diab

Abstract: In this paper, we propose a novel topic model based on incorporating dictionary definitions. Traditional topic models treat words as surface strings without assuming predefined knowledge about word meaning. They infer topics only by observing surface word co-occurrence. However, the co-occurred words may not be semantically related in a manner that is relevant for topic coherence. Exploiting dictionary definitions explicitly in our model yields a better understanding of word semantics leading to better text modeling. We exploit WordNet as a lexical resource for sense definitions. We show that explicitly modeling word definitions helps improve performance significantly over the baseline for a text categorization task.

5 0.082946442 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models

Author: David Mimno ; Hanna Wallach ; Edmund Talley ; Miriam Leenders ; Andrew McCallum

Abstract: Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to human domain experts. The contributions of this paper are threefold: (1) An analysis of the ways in which topics can be flawed; (2) an automated evaluation metric for identifying such topics that does not rely on human annotators or reference collections outside the training data; (3) a novel statistical topic model based on this metric that significantly improves topic quality in a large-scale document collection from the National Institutes of Health (NIH).

6 0.082699224 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

7 0.078433752 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

8 0.077900603 21 emnlp-2011-Bayesian Checking for Topic Models

9 0.077645294 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

10 0.071195461 38 emnlp-2011-Data-Driven Response Generation in Social Media

11 0.067647204 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

12 0.067638919 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

13 0.067353554 125 emnlp-2011-Statistical Machine Translation with Local Language Models

14 0.064636141 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

15 0.064084701 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

16 0.057603631 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

17 0.053271856 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

18 0.050364222 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

19 0.050269306 3 emnlp-2011-A Correction Model for Word Alignments

20 0.049264293 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.165), (1, 0.065), (2, 0.025), (3, -0.225), (4, -0.013), (5, 0.076), (6, 0.045), (7, 0.001), (8, -0.106), (9, -0.078), (10, 0.004), (11, 0.023), (12, 0.109), (13, 0.133), (14, 0.103), (15, 0.058), (16, 0.044), (17, -0.202), (18, -0.1), (19, -0.124), (20, -0.054), (21, -0.159), (22, -0.111), (23, 0.052), (24, -0.015), (25, -0.202), (26, 0.029), (27, -0.009), (28, -0.009), (29, 0.231), (30, -0.026), (31, -0.009), (32, -0.128), (33, 0.028), (34, 0.163), (35, 0.052), (36, -0.017), (37, -0.004), (38, -0.074), (39, 0.097), (40, 0.074), (41, 0.021), (42, 0.071), (43, 0.082), (44, 0.058), (45, 0.048), (46, 0.072), (47, -0.044), (48, -0.02), (49, 0.12)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95371908 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

2 0.78672671 118 emnlp-2011-SMT Helps Bitext Dependency Parsing

Author: Wenliang Chen ; Jun'ichi Kazama ; Min Zhang ; Yoshimasa Tsuruoka ; Yujie Zhang ; Yiou Wang ; Kentaro Torisawa ; Haizhou Li

Abstract: We propose a method to improve the accuracy of parsing bilingual texts (bitexts) with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are hard to obtain. Instead, our approach uses an auto-generated bilingual treebank to produce bilingual constraints. However, because the auto-generated bilingual treebank contains errors, the bilingual constraints are noisy. To overcome this problem, we use large-scale unannotated data to verify the constraints and design a set of effective bilingual features for parsing models based on the verified results. The experimental results show that our new parsers significantly outperform state-of-theart baselines. Moreover, our approach is still able to provide improvement when we use a larger monolingual treebank that results in a much stronger baseline. Especially notable is that our approach can be used in a purely monolingual setting with the help of SMT.

3 0.60125381 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

Author: Xabier Saralegi ; Iker Manterola ; Inaki San Vicente

Abstract: An A-C bilingual dictionary can be inferred by merging A-B and B-C dictionaries using B as pivot. However, polysemous pivot words often produce wrong translation candidates. This paper analyzes two methods for pruning wrong candidates: one based on exploiting the structure of the source dictionaries, and the other based on distributional similarity computed from comparable corpora. As both methods depend exclusively on easily available resources, they are well suited to less resourced languages. We studied whether these two techniques complement each other given that they are based on different paradigms. We also researched combining them by looking for the best adequacy depending on various application scenarios. ,

4 0.55525297 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection

Author: Amittai Axelrod ; Xiaodong He ; Jianfeng Gao

Abstract: Xiaodong He Microsoft Research Redmond, WA 98052 xiaohe @mi cro s o ft . com Jianfeng Gao Microsoft Research Redmond, WA 98052 j fgao @mi cro s o ft . com have its own argot, vocabulary or stylistic preferences, such that the corpus characteristics will necWe explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora 1% the size of the original can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. – –

5 0.54492301 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

Author: Jagadeesh Jagarlamudi ; Raghavendra Udupa ; Hal Daume III ; Abhijit Bhole

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of cross-lingual corpora. Many existing approaches are based on word co-occurrences extracted from aligned training data, represented as a covariance matrix. In theory, such a covariance matrix should represent semantic equivalence, and should be highly sparse. Unfortunately, the presence of noise leads to dense covariance matrices which in turn leads to suboptimal document representations. In this paper, we explore techniques to recover the desired sparsity in covariance matrices in two ways. First, we explore word association measures and bilingual dictionaries to weigh the word pairs. Later, we explore different selection strategies to remove the noisy pairs based on the association scores. Our experimental results on the task of aligning comparable documents shows the efficacy of sparse covariance matrices on two data sets from two different language pairs.

6 0.35656372 38 emnlp-2011-Data-Driven Response Generation in Social Media

7 0.33507839 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

8 0.31914589 21 emnlp-2011-Bayesian Checking for Topic Models

9 0.3067795 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

10 0.30372772 101 emnlp-2011-Optimizing Semantic Coherence in Topic Models

11 0.29968926 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

12 0.27744687 76 emnlp-2011-Language Models for Machine Translation: Original vs. Translated Texts

13 0.27425268 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

14 0.25435513 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation

15 0.25071934 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

16 0.24406071 3 emnlp-2011-A Correction Model for Word Alignments

17 0.23323146 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

18 0.22828095 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

19 0.21857952 125 emnlp-2011-Statistical Machine Translation with Local Language Models

20 0.20166792 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.111), (36, 0.029), (37, 0.037), (39, 0.324), (45, 0.089), (53, 0.062), (54, 0.026), (57, 0.02), (64, 0.028), (66, 0.026), (69, 0.015), (79, 0.048), (82, 0.023), (87, 0.012), (96, 0.013), (98, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85229909 43 emnlp-2011-Domain-Assisted Product Aspect Hierarchy Generation: Towards Hierarchical Organization of Unstructured Consumer Reviews

Author: Jianxing Yu ; Zheng-Jun Zha ; Meng Wang ; Kai Wang ; Tat-Seng Chua

Abstract: This paper presents a domain-assisted approach to organize various aspects of a product into a hierarchy by integrating domain knowledge (e.g., the product specifications), as well as consumer reviews. Based on the derived hierarchy, we generate a hierarchical organization of consumer reviews on various product aspects and aggregate consumer opinions on these aspects. With such organization, user can easily grasp the overview of consumer reviews. Furthermore, we apply the hierarchy to the task of implicit aspect identification which aims to infer implicit aspects of the reviews that do not explicitly express those aspects but actually comment on them. The experimental results on 11popular products in four domains demonstrate the effectiveness of our approach.

2 0.70533007 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels

Author: Chao Shen ; Tao Li

Abstract: In active dual supervision, not only informative examples but also features are selected for labeling to build a high quality classifier with low cost. However, how to measure the informativeness for both examples and feature on the same scale has not been well solved. In this paper, we propose a non-negative matrix factorization based approach to address this issue. We first extend the matrix factorization framework to explicitly model the corresponding relationships between feature classes and examples classes. Then by making use of the reconstruction error, we propose a unified scheme to determine which feature or example a classifier is most likely to benefit from having labeled. Empirical results demonstrate the effectiveness of our proposed methods.

same-paper 3 0.67956275 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation

Author: Zhengxian Gong ; Min Zhang ; Guodong Zhou

Abstract: Statistical machine translation systems are usually trained on a large amount of bilingual sentence pairs and translate one sentence at a time, ignoring document-level information. In this paper, we propose a cache-based approach to document-level translation. Since caches mainly depend on relevant data to supervise subsequent decisions, it is critical to fill the caches with highly-relevant data of a reasonable size. In this paper, we present three kinds of caches to store relevant document-level information: 1) a dynamic cache, which stores bilingual phrase pairs from the best translation hypotheses of previous sentences in the test document; 2) a static cache, which stores relevant bilingual phrase pairs extracted from similar bilingual document pairs (i.e. source documents similar to the test document and their corresponding target documents) in the training parallel corpus; 3) a topic cache, which stores the target-side topic words related with the test document in the source-side. In particular, three new features are designed to explore various kinds of document-level information in above three kinds of caches. Evaluation shows the effectiveness of our cache-based approach to document-level translation with the performance improvement of 0.8 1 in BLUE score over Moses. Especially, detailed analysis and discussion are presented to give new insights to document-level translation. 1

4 0.45048204 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

Author: Yang Gao ; Philipp Koehn ; Alexandra Birch

Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.

5 0.44632334 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.

6 0.4462671 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

7 0.44561881 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

8 0.44404173 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models

9 0.44209716 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

10 0.43966532 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

11 0.43835515 136 emnlp-2011-Training a Parser for Machine Translation Reordering

12 0.43693626 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

13 0.43682712 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

14 0.43656781 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

15 0.43635616 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases

16 0.43608862 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

17 0.4360263 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification

18 0.43347043 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing

19 0.4325566 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning

20 0.43220207 38 emnlp-2011-Data-Driven Response Generation in Social Media