acl acl2012 acl2012-134 knowledge-graph by maker-knowledge-mining

134 acl-2012-Learning to Find Translations and Transliterations on the Web


Source: pdf

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 tw identifying such translation counterparts Web, we can cope with the OOV problem. [sent-12, score-0.198]

2 In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. [sent-13, score-0.488]

3 The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. [sent-14, score-1.125]

4 At runtime, the model is used to extracting translation candidates for a given term. [sent-15, score-0.261]

5 Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. [sent-16, score-0.107]

6 1 Introduction The phrase translation problem is critical to machine translation, cross-lingual information retrieval, and multilingual terminology (Bian and Chen 2000, Kupiec 1993). [sent-17, score-0.443]

7 However, the out of vocabulary problem (OOV) is hard to overcome even with a very large training corpus due to the Zipf nature of word distribution, and ever growing new terminology and named entities. [sent-19, score-0.201]

8 Luckily, there are an abundant of webpages consisting mixed-code text, typically written in one language but interspersed with some sentential or phrasal translations in another language. [sent-20, score-0.608]

9 By retrieving and 130 Consider the technical term named-entity recognition. [sent-21, score-0.149]

10 The best places to find the Chinese translations for named-entity recognition are probably not some parallel corpus or dictionary, but rather mixed-code webpages. [sent-22, score-0.438]

11 The following example is a snippet returned by the Bing search engine for the query, named entity recognition: . [sent-23, score-0.476]

12 , This snippet contains three technical terms in Chinese (i. [sent-29, score-0.079]

13 , 自 然語言剖析 zhiran yuyan poxi, 問題分類 wenti fenlei, 專名辨識 zhuanming bianshi), followed by source terms in brackets (respectively, Natural Language Parsing, Question Classification, and Named Entity Recognition). [sent-31, score-0.037]

14 Quoh (2006) points out that submitting the source term and partial translation to a search engine is a good strategy used by many translators. [sent-32, score-0.543]

15 Unfortunately, the user still has to sift through snippets to find the translations. [sent-33, score-0.372]

16 For a given English term, such translations can be extracted by casting the problem as a sequence labeling task for classifying the Chinese characters in the snippets as either translation or non-translation. [sent-34, score-0.992]

17 Previous work has pointed out that such translations usually exhibit characteristics related to word translation, word transliteration, surface patterns, and proximity to the occurrences of the original phrase (Nagata et. [sent-35, score-0.465]

18 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 3s0–134, Thus, we also associate features to each Chinese token (characters or words) to reflect the likelihood of the token being part of the translation. [sent-40, score-0.138]

19 We describe how to train a CRF model for identifying translations in more details in Section 3. [sent-41, score-0.356]

20 At run-time, the system accepts a given phrase (e. [sent-42, score-0.093]

21 , named-entity recognition), and then query a search engine for webpages in the target language (e. [sent-44, score-0.408]

22 Subsequently, we retrieve mixed-code snippets and identify the translations of the given term. [sent-47, score-0.756]

23 The system can potentially be used to assist translators to find the most common translation for a given term, or to supplement a bilingual terminology bank (e. [sent-48, score-0.54]

24 , adding multilingual titles to existing Wikipedia); alternatively, they can be used as additional training data for a machine translation system, as described in Lin et al. [sent-50, score-0.432]

25 2 Related Work Phrase translation and transliteration is important for cross-language tasks. [sent-52, score-0.424]

26 For example, Knight and Graehl (1998) describe and evaluate a multi-stage machine translation method for back transliterating English names into Japanese, while Bian and Chen (2000) describe cross-language information access to multilingual collections on the Internet. [sent-53, score-0.34]

27 Recently, researchers have begun to exploit mixed code webpages for word and phrase translation. [sent-54, score-0.315]

28 (2001) present a system for finding English translations for a given Japanese technical term using Japanese-English snippets returned by a search engine. [sent-56, score-1.084]

29 (2005) focus on named entity transliteration and implemented a cross-language name finder. [sent-58, score-0.4]

30 (2005) proposed a method to learn surface patterns to find translations in mixed code snippets. [sent-60, score-0.535]

31 (2004) propose a method for mining translations of web queries from anchor texts. [sent-63, score-0.527]

32 Cheng, et al (2004) propose a similar method for translating unknown queries with web corpora for cross-language information retrieval. [sent-64, score-0.21]

33 Gravano (2006) also propose similar methods using anchor texts. [sent-65, score-0.045]

34 (2008) proposed a method that performs word alignment between translations and phrases within parentheses in crawled webpages. [sent-67, score-0.356]

35 In contrast to previous work described above, we exploit surface patterns differently as a soft constraint, while requiring minimal human intervention to prepare the training data. [sent-71, score-0.049]

36 3 Method To find translations for a given term on the Web, a promising approach is automatically learning to extract phrasal translations or transliterations of phrase based on machine learning, or more specifically the conditional random fields (CRF) model. [sent-72, score-1.026]

37 We focus on the issue of finding translations in mixed code snippets returned by a search engine. [sent-73, score-1.037]

38 The translations are identified, tallied, ranked, and returned as the output of the system. [sent-74, score-0.492]

39 1 Preparing Data for CRF Classifier We make use a small set of term and translation pairs as seed data to retrieve and annotate mixedcode snippets from a search engine. [sent-76, score-0.839]

40 Features are generated based on other external knowledge sources as will be described in Section 3. [sent-77, score-0.046]

41 An example data generated with given term Emmy Award with features and translation/nontranslation labels is shown in Figure 1 using the common BIO notation. [sent-82, score-0.109]

42 We use a list of randomly selected source and target terms as seed data (e. [sent-86, score-0.037]

43 , Wikipedia English titles and their Chinese counterpart using the language links). [sent-88, score-0.178]

44 , Emmy Awards) to query a search engine with the target webpage language set to the target language (e. [sent-91, score-0.307]

45 , Chinese), biasing the search engine to return Chinese webpages interspersed with some English phrases. [sent-93, score-0.391]

46 We then automatically label each Chinese character of the returned snippets, with B, I, O indicating respectively beginning, inside, and outside of translations. [sent-94, score-0.171]

47 In Figure 1, the translation 艾美獎 (ai mei jiang) are labeled as B I while all I, other Chinese characters are labeled as O. [sent-95, score-0.267]

48 An additional tag of E is used to indicate the occurrences of the given term (e. [sent-96, score-0.109]

49 We used the publicly available English-Chinese Bilingual WordNet and NICT terminology bank to generate translation features in our implementation. [sent-105, score-0.363]

50 The bilingual WordNet has 99,642 synset entries, with a total of some 270,000 translation pairs, mainly common nouns. [sent-106, score-0.307]

51 1 million bilingual terms in 72 categories, covering a wide variety of different fields. [sent-108, score-0.146]

52 Since many terms are transliterated, it is important to include transliteration feature. [sent-112, score-0.263]

53 We first use a list of name transliterated pairs, then use ExpectationMaximization (EM) algorithm to align English syllables Romanized Chinese characters. [sent-113, score-0.13]

54 Finally, we use the alignment information to generate transliteration feature for a Chinese token with respect to English words in the query. [sent-114, score-0.295]

55 132 We extract person or location entries in Wikipedia as name transliterated pairs to generate transliteration features in our implementation. [sent-115, score-0.356]

56 A total of some 15,000 bilingual names of persons and 24,000 bilingual place names were obtained and forced aligned to obtain transliteration relationships. [sent-117, score-0.54]

57 In the final stage of preparing training data, we add the distance, i. [sent-121, score-0.052]

58 number of words, between a Chinese token feature and the English term in question, aimed at exploiting the fact that translations tend to occur near the source term, as noted in Nagata et al. [sent-123, score-0.534]

59 Finally, we use the data labeled with translation tags and three kinds feature values to train a CRF model. [sent-126, score-0.198]

60 2 Run-Time Translation Extraction With the trained CRF model, we then attempt to find translations for a given phrase. [sent-128, score-0.391]

61 The system begins by submitting the given phrase as query to a search engine to retrieve snippets, and generate features for each tokens in the same way as done in the training phase. [sent-129, score-0.511]

62 We then use the trained model to tag the snippets, and extract translation candidates by identifying consecutive Chinese tokens labeled as B and I. [sent-130, score-0.304]

63 Finally, we compute the frequency of all the candidates identified in all snippets, and output the one with the highest frequency. [sent-131, score-0.063]

64 4 Experiments and Evaluation We extracted the Wikipedia titles of English and Chinese articles connected through language links for training and testing. [sent-132, score-0.178]

65 We obtained a total of 155,3 10 article pairs, from which we then randomly selected 13,150 and 2,18 1 titles as seeds to obtain the training and test data. [sent-133, score-0.22]

66 Since we are using Wikipedia bilingual titles as the gold standard, we exclude any snippets from the wikipedia. [sent-134, score-0.624]

67 The test set contains 745,734 snippets or 9,158,141 tokens (Chinese character or English word). [sent-136, score-0.415]

68 The reference answer appeared a total of 48,938 times or 180,932 tokens (2%), and an average of 22. [sent-137, score-0.131]

69 System Full (En-Ch) -TL -TR -TL-TR LIN En-Ch LIN Ch-En Coverage Exact match Top5 exact match 80. [sent-139, score-0.204]

70 Automatic evaluation results of 8 experiments: (1) Full system (2-4) -TL, -TR, -TL-TR : Full system deprecating TL, TR, and TL+TL features (5,6) LIN EnCh and En-Ch : the results in Lin et al. [sent-160, score-0.066]

71 (2008) (6) LDC: LDC E-C dictionary (7) NICT : NICT term bank. [sent-161, score-0.109]

72 We ran the system and produced the translations for these 2,181 test data, and automatically evaluate the results using the metrics of coverage, i. [sent-174, score-0.389]

73 when system was able to produce translation candidates, and exact match precision. [sent-176, score-0.362]

74 This precision rate is an under-estimations, since a term may have many alternative translations that does not match exactly with one single reference translation. [sent-177, score-0.582]

75 To give a more accurate estimate of real precision, we resorted to manual evaluation on a small part of the 2,18 1 English phrases and a 133 small set of English Wikipedia titles without a Chinese language link. [sent-178, score-0.178]

76 1 Automatic Evaluation In this section, we describe the evaluation based on English-Chinese titles extracted from Wikipedia as the gold standard. [sent-180, score-0.178]

77 Our system produce the top-1 translations by ranking candidates by frequency and output the most frequent translations. [sent-181, score-0.452]

78 The results indicate that using external knowledge to generate feature improves system performance significantly. [sent-185, score-0.079]

79 By adding translation feature (TL) or transliteration feature (TR) to the system with no external knowledge features (-TL-TR) improves exact match precision by about 6% and 16% respectively. [sent-186, score-0.634]

80 Because many Wikipedia titles are named entities, transliteration feature is the most important. [sent-187, score-0.476]

81 Overall, the system with full features perform the best, finding reasonably correct translations for 8 out of 10 phrases. [sent-188, score-0.531]

82 2 Manual Evaluation Evaluation based on exact match against a single reference answer leads to under-estimation, because an English phrase is often translated into several Chinese counterparts. [sent-190, score-0.279]

83 The judge was instructed to mark each output as A: correct translation alternative, B: correct translation but with a difference sense from the reference, P: partially correct translation, and E: incorrect translation. [sent-192, score-0.587]

84 Table 2 shows some translations generated by the full system that does not match the single reference translation. [sent-193, score-0.541]

85 Half of the translations are correct translations (A and B), while a third are partially correct translation (P). [sent-194, score-1.049]

86 Therefore, some partial translations may still be considered as correct (B). [sent-196, score-0.408]

87 To Evaluate titles without a language link, we sampled a list of 95 terms from the unlinked portion of Wikipedia using the criteria: (1) with a frequency count of over 2,000 in Google Web 1T. [sent-197, score-0.215]

88 Interestingly, our system provides correct translations for over 50% of the cases, and at least partially correct almost 90% of the cases. [sent-201, score-0.528]

89 5 Conclusion and Future work We have presented a new method for finding translations on the Web for a given term. [sent-202, score-0.411]

90 In our approach, we use a small set of terms and translations as seeds to obtain and to tag mixedcode snippets returned by a search engine, in order to train a CRF model for sequence labels. [sent-203, score-1.04]

91 This CRF model is then used to tag the returned snippets for a given query term to extraction translation candidates, which are then ranked and returned as output. [sent-204, score-0.992]

92 Preliminary experiments and evaluations show our learning-based method cleanly combining various features, producing quality translations and transliterations. [sent-205, score-0.43]

93 For example, existing query expansion methods could be implemented to retrieve more webpages containing translations. [sent-207, score-0.299]

94 Additionally, an interesting direction to explore is to identify phrase types and train type-specific CRF model. [sent-208, score-0.06]

95 Cross-language information access to multilingual collections on the internet. [sent-215, score-0.094]

96 Translating unknown queries with web corpora for cross-language information retrieval. [sent-239, score-0.126]

97 Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. [sent-246, score-0.126]

98 CHINET: a Chinese name finder system for document triage. [sent-270, score-0.081]

99 Translating Chinese Romanized name into Chinese idiographic characters via corpus and web validation. [sent-284, score-0.201]

100 Mining translations of OOV terms from the web through cross-lingual query expansion. [sent-320, score-0.553]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('translations', 0.356), ('snippets', 0.337), ('transliteration', 0.226), ('chinese', 0.217), ('translation', 0.198), ('emmy', 0.184), ('titles', 0.178), ('webpages', 0.16), ('tl', 0.141), ('nict', 0.137), ('returned', 0.136), ('wikipedia', 0.133), ('terminology', 0.129), ('engine', 0.114), ('crf', 0.114), ('kuangfu', 0.11), ('bilingual', 0.109), ('term', 0.109), ('bian', 0.096), ('hsinchu', 0.096), ('tsing', 0.096), ('lin', 0.09), ('nagata', 0.087), ('web', 0.084), ('transliterated', 0.082), ('transliterations', 0.077), ('query', 0.076), ('english', 0.074), ('cleanly', 0.074), ('mixedcode', 0.074), ('palgrave', 0.074), ('match', 0.073), ('named', 0.072), ('characters', 0.069), ('token', 0.069), ('hua', 0.068), ('tr', 0.067), ('oov', 0.065), ('submitting', 0.064), ('romanized', 0.064), ('taiwan', 0.063), ('road', 0.063), ('award', 0.063), ('candidates', 0.063), ('retrieve', 0.063), ('phrase', 0.06), ('interspersed', 0.059), ('awards', 0.059), ('webpage', 0.059), ('kwok', 0.059), ('exact', 0.058), ('search', 0.058), ('multilingual', 0.056), ('finding', 0.055), ('mixed', 0.055), ('entity', 0.054), ('correct', 0.052), ('preparing', 0.052), ('surface', 0.049), ('translating', 0.048), ('names', 0.048), ('name', 0.048), ('recognition', 0.047), ('external', 0.046), ('anchor', 0.045), ('reference', 0.044), ('wu', 0.044), ('japanese', 0.044), ('answer', 0.044), ('tokens', 0.043), ('seeds', 0.042), ('snippet', 0.042), ('queries', 0.042), ('wiki', 0.041), ('proceeding', 0.041), ('code', 0.04), ('retrieving', 0.04), ('cheng', 0.04), ('ldc', 0.039), ('collections', 0.038), ('terms', 0.037), ('al', 0.036), ('bank', 0.036), ('partially', 0.035), ('full', 0.035), ('find', 0.035), ('character', 0.035), ('chang', 0.034), ('system', 0.033), ('distance', 0.033), ('phrasal', 0.033), ('luckily', 0.032), ('zipf', 0.032), ('lcd', 0.032), ('awarding', 0.032), ('kupiec', 0.032), ('casting', 0.032), ('saito', 0.032), ('waste', 0.032), ('acldemo', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

2 0.24217007 212 acl-2012-Using Search-Logs to Improve Query Tagging

Author: Kuzman Ganchev ; Keith Hall ; Ryan McDonald ; Slav Petrov

Abstract: Syntactic analysis of search queries is important for a variety of information-retrieval tasks; however, the lack of annotated data makes training query analysis models difficult. We propose a simple, efficient procedure in which part-of-speech tags are transferred from retrieval-result snippets to queries at training time. Unlike previous work, our final model does not require any additional resources at run-time. Compared to a state-ofthe-art approach, we achieve more than 20% relative error reduction. Additionally, we annotate a corpus of search queries with partof-speech tags, providing a resource for future work on syntactic query analysis.

3 0.23276511 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

4 0.16004267 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

Author: Hui Zhang ; David Chiang

Abstract: Syntax-based translation models that operate on the output of a source-language parser have been shown to perform better if allowed to choose from a set of possible parses. In this paper, we investigate whether this is because it allows the translation stage to overcome parser errors or to override the syntactic structure itself. We find that it is primarily the latter, but that under the right conditions, the translation stage does correct parser errors, improving parsing accuracy on the Chinese Treebank.

5 0.15236695 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

6 0.14815156 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

7 0.1451759 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

8 0.14511113 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

9 0.12647957 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

10 0.12557548 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

11 0.1198764 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

12 0.11584621 131 acl-2012-Learning Translation Consensus with Structured Label Propagation

13 0.11409474 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

14 0.11152286 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

15 0.10971068 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

16 0.10814214 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

17 0.10624491 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

18 0.10379709 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

19 0.10066696 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

20 0.097928368 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.293), (1, -0.113), (2, 0.076), (3, 0.073), (4, 0.174), (5, 0.111), (6, 0.029), (7, -0.087), (8, -0.025), (9, -0.061), (10, 0.144), (11, 0.031), (12, -0.014), (13, 0.069), (14, 0.073), (15, -0.096), (16, 0.082), (17, -0.005), (18, 0.054), (19, 0.07), (20, -0.132), (21, 0.225), (22, 0.2), (23, 0.003), (24, -0.099), (25, 0.026), (26, 0.003), (27, -0.013), (28, -0.029), (29, 0.005), (30, -0.118), (31, -0.021), (32, -0.121), (33, -0.07), (34, 0.041), (35, -0.029), (36, 0.133), (37, 0.106), (38, 0.107), (39, -0.067), (40, 0.106), (41, -0.087), (42, 0.016), (43, -0.018), (44, 0.061), (45, 0.037), (46, -0.021), (47, -0.147), (48, 0.124), (49, -0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94181466 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

2 0.6890232 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

Author: Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task data and on parallel corpora with competitive results.

3 0.6313743 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

Author: Ming-Hong Bai ; Yu-Ming Hsieh ; Keh-Jiann Chen ; Jason S. Chang

Abstract: In this paper, we propose a web-based bilingual concordancer, DOMCAT , for domain-specific computer assisted translation. Given a multi-word expression as a query, the system involves retrieving sentence pairs from a bilingual corpus, identifying translation equivalents of the query in the sentence pairs (translation spotting) and ranking the retrieved sentence pairs according to the relevance between the query and the translation equivalents. To provide high-precision translation spotting for domain-specific translation tasks, we exploited a normalized correlation method to spot the translation equivalents. To ranking the retrieved sentence pairs, we propose a correlation function modified from the Dice coefficient 1 for assessing the correlation between the query and the translation equivalents. The performances of the translation spotting module and the ranking module are evaluated in terms of precision-recall measures and coverage rate respectively. 1

4 0.62171787 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

Author: Preslav Nakov ; Jorg Tiedemann

Abstract: We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.

5 0.54893053 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters

Author: Wei-Jie Huang ; Chia-Ru Chou ; Yu-Lin Tzeng ; Chia-Ying Lee ; Chao-Lin Liu

Abstract: We demonstrate applications of psycholinguistic and sublexical information for learning Chinese characters. The knowledge about the grapheme-phoneme conversion (GPC) rules of languages has been shown to be highly correlated to the ability of reading alphabetic languages and Chinese. We build and will demo a game platform for strengthening the association of phonological components in Chinese characters with the pronunciations of the characters. Results of a preliminary evaluation of our games indicated significant improvement in learners’ response times in Chinese naming tasks. In addition, we construct a Webbased open system for teachers to prepare their own games to best meet their teaching goals. Techniques for decomposing Chinese characters and for comparing the similarity between Chinese characters were employed to recommend lists of Chinese characters for authoring the games. Evaluation of the authoring environment with 20 subjects showed that our system made the authoring of games more effective and efficient.

6 0.54495609 212 acl-2012-Using Search-Logs to Improve Query Tagging

7 0.53100109 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

8 0.52399409 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

9 0.47335523 140 acl-2012-Machine Translation without Words through Substring Alignment

10 0.45834127 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

11 0.45316586 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

12 0.45065871 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

13 0.44908264 35 acl-2012-Automatically Mining Question Reformulation Patterns from Search Log Data

14 0.4395746 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

15 0.43852758 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

16 0.43760982 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

17 0.43593344 131 acl-2012-Learning Translation Consensus with Structured Label Propagation

18 0.43216082 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

19 0.42926416 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

20 0.42162809 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(26, 0.447), (28, 0.067), (30, 0.012), (37, 0.027), (39, 0.042), (74, 0.025), (82, 0.012), (85, 0.027), (90, 0.161), (92, 0.04), (94, 0.014), (99, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95189482 91 acl-2012-Extracting and modeling durations for habits and events from Twitter

Author: Jennifer Williams ; Graham Katz

Abstract: We seek to automatically estimate typical durations for events and habits described in Twitter tweets. A corpus of more than 14 million tweets containing temporal duration information was collected. These tweets were classified as to their habituality status using a bootstrapped, decision tree. For each verb lemma, associated duration information was collected for episodic and habitual uses of the verb. Summary statistics for 483 verb lemmas and their typical habit and episode durations has been compiled and made available. This automatically generated duration information is broadly comparable to hand-annotation. 1

2 0.93630105 209 acl-2012-Unsupervised Semantic Role Induction with Global Role Ordering

Author: Nikhil Garg ; James Henserdon

Abstract: We propose a probabilistic generative model for unsupervised semantic role induction, which integrates local role assignment decisions and a global role ordering decision in a unified model. The role sequence is divided into intervals based on the notion of primary roles, and each interval generates a sequence of secondary roles and syntactic constituents using local features. The global role ordering consists of the sequence of primary roles only, thus making it a partial ordering.

3 0.93337518 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

Author: Rafal Rak ; BalaKrishna Kolluru ; Sophia Ananiadou

Abstract: Argo is a web-based NLP and text mining workbench with a convenient graphical user interface for designing and executing processing workflows of various complexity. The workbench is intended for specialists and nontechnical audiences alike, and provides the ever expanding library of analytics compliant with the Unstructured Information Management Architecture, a widely adopted interoperability framework. We explore the flexibility of this framework by demonstrating workflows involving three processing components capable of performing self-contained machine learning-based tagging. The three components are responsible for the three distinct tasks of 1) generating observations or features, 2) training a statistical model based on the generated features, and 3) tagging unlabelled data with the model. The learning and tagging components are based on an implementation of conditional random fields (CRF); whereas the feature generation component is an analytic capable of extending basic token information to a comprehensive set of features. Users define the features of their choice directly from Argo’s graphical interface, without resorting to programming (a commonly used approach to feature engineering). The experimental results performed on two tagging tasks, chunking and named entity recognition, showed that a tagger with a generic set of features built in Argo is capable of competing with taskspecific solutions. 121

same-paper 4 0.89990711 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

5 0.79513061 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Author: Micha Elsner ; Sharon Goldwater ; Jacob Eisenstein

Abstract: ILCC, School of Informatics School of Interactive Computing University of Edinburgh Georgia Institute of Technology Edinburgh, EH8 9AB, UK Atlanta, GA, 30308, USA (a) intended: /ju want w2n/ /want e kUki/ (b) surface: [j@ w a?P w2n] [wan @ kUki] During early language acquisition, infants must learn both a lexicon and a model of phonetics that explains how lexical items can vary in pronunciation—for instance “the” might be realized as [Di] or [D@]. Previous models of acquisition have generally tackled these problems in isolation, yet behavioral evidence suggests infants acquire lexical and phonetic knowledge simultaneously. We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. The model is trained on transcribed surface pronunciations, and learns by bootstrapping, without access to the true lexicon. We test the model using a corpus of child-directed speech with realistic phonetic variation and either gold standard or automatically induced word boundaries. In both cases modeling variability improves the accuracy of the learned lexicon over a system that assumes each lexical item has a unique pronunciation.

6 0.6257804 187 acl-2012-Subgroup Detection in Ideological Discussions

7 0.6198566 48 acl-2012-Classifying French Verbs Using French and English Lexical Resources

8 0.59399414 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

9 0.56954485 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

10 0.56924349 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

11 0.56696266 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

12 0.55623358 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

13 0.55590528 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

14 0.55588669 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

15 0.55483705 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

16 0.55458266 83 acl-2012-Error Mining on Dependency Trees

17 0.5541274 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

18 0.55242378 137 acl-2012-Lemmatisation as a Tagging Task

19 0.55207205 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

20 0.55163223 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation