acl acl2013 acl2013-92 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon
Abstract: Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. [sent-3, score-0.638]
2 We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. [sent-4, score-0.585]
3 The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. [sent-5, score-0.462]
4 Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0. [sent-6, score-0.582]
5 Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text. [sent-11, score-0.449]
6 1 Introduction Word sense disambiguation (WSD) is the task of assigning sense tags to ambiguous lexical items (LIs) in a text. [sent-12, score-0.295]
7 Translation selection chooses target language items for translating ambiguous LIs in a text, and can therefore be viewed as a kind of WSD task, with translations as the sense tags. [sent-13, score-0.118]
8 The translation selection task may also be modified slightly to output a ranked list of translations. [sent-14, score-0.264]
9 This then re- sembles a dictionary lookup process as performed by a human reader when reading or browsing a text written in a second or foreign language. [sent-15, score-0.374]
10 There is a large body of work around WSD and translation selection. [sent-23, score-0.191]
11 However, many of these approaches require lexical resources or large bilin- gual corpora with rich information fields and annotations, as reviewed in section 2. [sent-24, score-0.11]
12 Unfortunately, not all languages have equal amounts of digital resources for developing language technologies, and such requirements are often infeasible for underresourced languages. [sent-25, score-0.192]
13 We are interested in leveraging richer-resourced language pairs to enable context-dependent lexical lookup for under-resourced languages. [sent-26, score-0.376]
14 For this purpose, we model translation context knowledge as a second-order co-occurrence bag-of-words model. [sent-27, score-0.234]
15 We propose a rapid approach for acquiring them from an untagged, comparable bilingual corpus of a (richer-resourced) language pair in section 3. [sent-28, score-0.209]
16 This information is then transferred into a multilingual lexicon to perform context-dependent lexical lookup on input texts, including those in an underresourced language (section 4). [sent-29, score-0.693]
17 Section 5 describes a prototype implementation, where translation context knowledge is extracted from a English–Malay bilingual corpus to enrich a multilingual lexicon with six languages. [sent-30, score-0.597]
18 2 Typical Resource Requirements for Translation Selection WSD and translation selection approaches may be broadly classified into two categories depending 294 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-33, score-0.223]
19 Knowledge-based approaches make use of various types of information from existing dictionaries, thesauri, or other lexical resources. [sent-36, score-0.04]
20 Possible knowledge sources include definition or gloss text (Banerjee and Pedersen, 2003), subject codes (Magnini et al. [sent-37, score-0.036]
21 Nevertheless, lexical resources of such rich content types are usually available for medium- to richresourced languages only, and are costly to build and verify by hand. [sent-40, score-0.086]
22 Some approaches therefore turn to corpus-based approaches, use bilingual corpora as learning resources for translation selection. [sent-41, score-0.368]
23 As it is not always possible to acquire parallel corpora, comparable corpora, or even independent second-language corpora have also been shown to be suitable for training purposes, either by purely numerical means (Li and Li, 2004) or with the aid of syntactic relations (Zhou et al. [sent-45, score-0.106]
24 Vector-based models, which capture the context of a translation or meaning, have also been used (Schütze, 1998; Papp, 2009). [sent-47, score-0.234]
25 For underresourced languages, however, bilingual corpora of sufficient size may still be unavailable. [sent-48, score-0.294]
26 3 Enriching Multilingual Lexicon with Translation Context Knowledge Corpus-driven translation selection approaches typically derive supporting semantic information from an aligned corpus, where a text and its translation are aligned at the sentence, phrase and word level. [sent-49, score-0.414]
27 However, aligned corpora can be difficult to obtain for under-resourced language pairs, and are expensive to construct. [sent-50, score-0.037]
28 On the other hand, documents in a comparable corpus comprise bilingual or multilingual text of a similar nature, and need not even be exact translations of each other. [sent-51, score-0.314]
29 Comparable corpora are relatively easier to obtain, especially for richer-resourced languages. [sent-53, score-0.037]
30 1 Overview of Multilingual Lexicon Entries in our multilingual lexicon are organised as multilingual translation sets, each corresponding to a coarse-grained concept, and whose members are LIs from different languages {L1, . [sent-55, score-0.5]
31 For example, following are two translation sets containing different senses of English «bank» (‘financial institution’ and ‘riverside land’): TS1 = {«bank»eng, «bank»msa, «银行»zho, . [sent-61, score-0.191]
32 Multilingual lexicons with under-resourced languages can be rapidly bootstrapped from simple bilingual translation lists (Lim et al. [sent-68, score-0.377]
33 Our multilingual lexicon currently contains 24371 English, 13226 Chinese, 35640 Malay, 17063 French, 14687 Thai and 5629 Iban LIs. [sent-70, score-0.158]
34 2 Extracting Translation Context Knowledge from Comparable Corpus We model translation knowledge as a bag-of-words consisting of the context of a translation equivalence in the corpus. [sent-72, score-0.458]
35 We then run latent semantic indexing (LSI) (Deerwester et al. [sent-73, score-0.04]
36 A vector is then obtained for each LI in both languages, which may be regarded as encoding some translation context knowledge. [sent-75, score-0.234]
37 While LSI is more frequently used in information retrieval, the translation knowledge acquisition task can be recast as a cross-lingual indexing task, following (Dumais et al. [sent-76, score-0.231]
38 The underlying intuition is that in a comparable bilingual cor- pus, a document pair about finance would be more likely to contain English «bank»eng and Malay «bank»msa (‘financial institution’), as opposed to Malay «tebing»msa (‘riverside’). [sent-78, score-0.209]
39 The words appearing in this document pair would then be an indicative context for the translation equivalence between «bank»eng and «bank»msa. [sent-79, score-0.267]
40 In other words, the translation equivalents present serve as a kind of implicit sense tag. [sent-80, score-0.308]
41 Briefly, a translation knowledge vector is obtained for each multilingual translation set from a bilingual comparable corpus as follows: 1. [sent-81, score-0.696]
42 Each bilingual pair of documents is merged as one single document, with each LI tagged with its respective language code. [sent-82, score-0.14]
43 remove closedclass words, perform stemming or lemmatisation, and word segmentation for languages without word boundaries (Chinese, Thai). [sent-86, score-0.046]
44 Set the vector associated with each translation set to be the sum of all available vectors of its member LIs. [sent-96, score-0.244]
45 4 Context-Dependent Lexical Lookup Given an input text in language Li (1 ≤ i ≤ N), the lookup module should return a lis1t o ≤f m i ≤ulti Nlingual translation set entries, which would contain L1, L2, . [sent-97, score-0.606]
46 , LN translation equivalents of LIs in the input text, wherever available. [sent-100, score-0.264]
47 For polysemous LIs, the lookup module should return translation sets that convey the appropriate meaning in context. [sent-101, score-0.564]
48 For each input text segment Q (typically a sentence), a ‘query vector’, VQ is computed by taking the vectorial sum of all open class LIs in the input Q. [sent-102, score-0.084]
49 For each LI l in the input, the list of all translation sets containing l, is retrieved into TSl. [sent-103, score-0.191]
50 the cosine similarity between the query vector VQ and the translation set candidate t’s vector) for all t ∈ TSl. [sent-106, score-0.191]
51 lI ft the language of input Q is not present in the bilingual training corpus (e. [sent-107, score-0.182]
52 Iban, an underresourced language spoken in Borneo), VQ is then computed as the sum of all vectors associated with all translation sets in TSl. [sent-109, score-0.361]
53 5 Prototype Implementation We have implemented LEXICALSELECTOR, a prototype context-dependent lexical lookup tool in Java, trained on a English–Malay bilingual corpus built from Wikipedia articles. [sent-111, score-0.616]
54 Wikipedia articles are freely available under a Creative Commons license, thus providing a convenient source of bilingual comparable corpus. [sent-112, score-0.209]
55 Note that while the training corpus is English–Malay, the trained lookup tool can be applied to texts of any language included in the multilingual dictionary. [sent-113, score-0.476]
56 To form the bilingual corpus, each Malay article is concatenated with its corresponding English article as one document. [sent-115, score-0.14]
57 The indexing process, using 1000 factors, took about 45 minutes on a MacBook Pro with a 2. [sent-118, score-0.04]
58 The vectors obtained for each English and Malay LIs were then used to populate the translation context knowledge vectors of translation set in a multilingual lexicon, which comprise six languages: English, Malay, Chinese, French, Thai and Iban. [sent-120, score-0.636]
59 As mentioned earlier, LEXICALSELECTOR can process texts in any member languages of the multilingual lexicon, instead of only the languages of the training corpus (English and Malay). [sent-121, score-0.197]
60 Figure 1 shows the context-depended lexical lookup outputs for the Iban input ‘Lelaki nya tikah enggau emperaja iya, siko dayang ke ligung’ . [sent-122, score-0.978]
61 6 Early Experimental Results 80 input sentences containing LIs with translation ambiguities were randomly selected from the Internet (English, Malay and Chinese) and contributed by a native speaker (Iban). [sent-124, score-0.233]
62 com/gensim/ 296 Figure 1: LEXICALSELECTOR output for Iban input ‘Lelaki nya tikah enggau emperaja iya, siko dayang ke ligung’ . [sent-130, score-0.602]
63 English «bank» (financial institution or riverside land), • Malay «kabinet» (governmental Cabinet or household furniture), • Malay «mangga» (mango or padlock), • Chinese «谷» (gù, valley or grain) and • Iban «emperaja» (rainbow or lover). [sent-132, score-0.124]
64 LIs of each language and their associated vectors can then be retrieved from the multilingual lexicon. [sent-138, score-0.158]
65 The prototype tool LEXICALSELECTOR then computes the CSim score and ranks potential translation sets for each LI in the input sentences (ranking strategy wiki-lsi). [sent-139, score-0.361]
66 The baseline strategy (base-freq) selects the translation set whose members occur most frequently in the bilingual Wikipedia corpus. [sent-140, score-0.359]
67 ) The Google Translate interface makes available the ranked list of translation candidates for each word in an input sentence, one language 4http ://www-nlp . [sent-143, score-0.274]
68 The translated word for each of the input test word can therefore be noted. [sent-153, score-0.042]
69 The highest rank of the correct translation for the test words in English/Chinese/Malay are used to evaluate goog-tr. [sent-154, score-0.191]
70 The first metric is by taking the precision of the first translation set returned by each ranking strategy, i. [sent-156, score-0.259]
71 whether the top ranked translation set contains the correct translation of the ambiguous item. [sent-158, score-0.423]
72 The precision metric is important for applications like machine translation, where only the top-ranked meaning or translation is considered. [sent-159, score-0.259]
73 This is measured by the mean reciprocal rank (MRR), the average of the reciprocal ranks of the correct translation set for each input sentence in the test set T: MRR =|T1|Xi|=T|1ran1ki The results for the three ranking strategies are summarised in Table 1. [sent-163, score-0.315]
74 650 when all 80 input sentences are tested, while the base-freq baseline scored 0. [sent-165, score-0.08]
75 wiki-lsi even outperforms goog-tr when only the Chinese and Malay test sentences are considered for the MRR metric, as goog-tr — — 297 Table 1: Precision and MRR scores of contextdependent lexical lookup StrategyPIrnecclis. [sent-174, score-0.376]
76 78 640258 did not present the correct translation in its list of alternative translation candidates for some test sentences. [sent-181, score-0.382]
77 This suggests that the LSI-based translation context knowledge vectors would be helpful in building an intelligent reading aid. [sent-182, score-0.325]
78 7 Discussion wiki-lsi performed better than base-freq for both the precision and the MRR metrics, although further tests is warranted, given the small size of the current test set. [sent-183, score-0.041]
79 While wiki-lsi is not yet sufficiently accurate to be used directly in an MT system, it is helpful in producing a list of ranked multilingual translation sets depending on the input context, as part of an intelligent reading aid. [sent-184, score-0.417]
80 Specifically, the lookup module would have benefited if syntactic information (e. [sent-185, score-0.373]
81 Note that even though the translation context knowledge vectors were extracted from an English– Malay corpus, the same vectors can be applied on Chinese and Iban input sentences as well. [sent-189, score-0.382]
82 This is especially significant for Iban, which otherwise lacks resources from which a lookup or disambiguation tool can be trained. [sent-190, score-0.454]
83 Translation context knowledge vectors mined via LSI from a bilingual comparable corpus, therefore offers a fast, low cost and efficient fallback strategy for acquiring multilingual translation equivalence context information. [sent-191, score-0.705]
84 In the task’s ‘best’ evaluation (which is comparable to our ‘Precision’ metric), Basile and Semeraro’s system scored 26. [sent-195, score-0.107]
85 This strategy of selecting the most frequent translation is similar to our base-freq baseline strategy. [sent-198, score-0.219]
86 (201 1) also tackled the problem of cross-lingual disambiguation for underresourced language pairs (English–Persian) using Wikipedia articles, by applying the one sense per collocation and one sense per discourse heuristics on a comparable corpus. [sent-200, score-0.441]
87 However, developing wordnets for new languages is no trivial effort, as acknowledged by the authors. [sent-203, score-0.077]
88 9 Conclusion We extracted translation context knowledge from a bilingual comparable corpus by running LSI on the corpus. [sent-204, score-0.443]
89 A context-dependent multilingual lexical lookup module was implemented, using the cosine similarity score between the vector of the input sentence and those of candidate translation sets to rank the latter in order of relevance. [sent-205, score-0.751]
90 The precision and MRR scores outperformed Google Translate’s lexical selection for medium- and under-resourced language test inputs. [sent-206, score-0.113]
91 The LSI-backed translation context knowledge vectors, mined from bilingual comparable corpora, thus provide an fast and affordable data source for building intelligent reading aids, especially for under-resourced languages. [sent-207, score-0.481]
92 A 3-Letter ISO Language Codes Code Language Code Language eng zho tha English Chinese Thai msa fra iba Malay French Iban 298 References Satanjeev Banerjee and Ted Pedersen. [sent-210, score-0.359]
93 UBA: Using automatic translation and Wikipedia for crosslingual lexical substitution. [sent-216, score-0.231]
94 Low cost construction of a multilingual lexicon from bilingual lists. [sent-248, score-0.298]
95 OWNS: Crosslingual word sense disambiguation using weighted overlap counts and Wordnet based similarity measures. [sent-257, score-0.169]
96 Exploiting parallel texts for word sense disambiguation: An empirical study. [sent-265, score-0.086]
97 Vector-based unsupervised word sense disambiguation for large number of contexts. [sent-269, score-0.169]
98 Cross-lingual word sense disambiguation for languages with scarce resources. [sent-274, score-0.215]
99 Learning a robust word sense disambiguation model using hypernyms in definition sentences. [sent-283, score-0.169]
100 Improviging translation selection with a new translation model trained by independent monolingual corpora. [sent-288, score-0.414]
wordName wordTfidf (topN-words)
[('malay', 0.497), ('iban', 0.351), ('lookup', 0.336), ('translation', 0.191), ('emperaja', 0.141), ('iba', 0.141), ('malaysia', 0.141), ('bilingual', 0.14), ('lis', 0.129), ('underresourced', 0.117), ('lsi', 0.108), ('multilingual', 0.105), ('vq', 0.104), ('mrr', 0.101), ('lelaki', 0.094), ('lexicalselector', 0.094), ('ligung', 0.094), ('tikah', 0.094), ('bank', 0.093), ('sense', 0.086), ('dayang', 0.083), ('disambiguation', 0.083), ('msa', 0.076), ('chinese', 0.076), ('thai', 0.072), ('eng', 0.072), ('semeval', 0.071), ('enggau', 0.07), ('iya', 0.07), ('lover', 0.07), ('nya', 0.07), ('riverside', 0.07), ('sarawak', 0.07), ('siko', 0.07), ('tdm', 0.07), ('xxv', 0.07), ('zho', 0.07), ('comparable', 0.069), ('prototype', 0.065), ('institution', 0.054), ('basile', 0.054), ('lexicon', 0.053), ('vectors', 0.053), ('li', 0.049), ('wsd', 0.048), ('wikipedia', 0.048), ('lim', 0.047), ('csim', 0.047), ('enya', 0.047), ('gmail', 0.047), ('mahapatra', 0.047), ('penang', 0.047), ('rainbow', 0.047), ('sarrafzadeh', 0.047), ('semeraro', 0.047), ('shirai', 0.047), ('tebing', 0.047), ('universiti', 0.047), ('iso', 0.046), ('languages', 0.046), ('context', 0.043), ('input', 0.042), ('giovanni', 0.041), ('lefever', 0.041), ('lian', 0.041), ('english', 0.041), ('precision', 0.041), ('reciprocal', 0.041), ('ranked', 0.041), ('indexing', 0.04), ('lexical', 0.04), ('scored', 0.038), ('reading', 0.038), ('tze', 0.038), ('corpora', 0.037), ('module', 0.037), ('codes', 0.036), ('bali', 0.036), ('financial', 0.036), ('tool', 0.035), ('uppsala', 0.033), ('dumais', 0.033), ('equivalence', 0.033), ('gual', 0.033), ('ke', 0.032), ('xv', 0.032), ('selection', 0.032), ('equivalents', 0.031), ('multimedia', 0.031), ('persian', 0.031), ('wordnets', 0.031), ('land', 0.03), ('infeasible', 0.029), ('strategy', 0.028), ('deerwester', 0.027), ('metric', 0.027), ('google', 0.027), ('magnini', 0.027), ('ide', 0.027), ('ln', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon
Abstract: Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.
2 0.13048297 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
Author: Dhouha Bouamor ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the unresolved problem of polysemous words revealed by the bilingual dictionary and introduce a use of a Word Sense Disambiguation process that aims at improving the adequacy of context vectors. On two specialized FrenchEnglish comparable corpora, empirical experimental results show that our method improves the results obtained by two stateof-the-art approaches.
3 0.12157873 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
Author: Jiajun Zhang ; Chengqing Zong
Abstract: Currently, almost all of the statistical machine translation (SMT) models are trained with the parallel corpora in some specific domains. However, when it comes to a language pair or a different domain without any bilingual resources, the traditional SMT loses its power. Recently, some research works study the unsupervised SMT for inducing a simple word-based translation model from the monolingual corpora. It successfully bypasses the constraint of bitext for SMT and obtains a relatively promising result. In this paper, we take a step forward and propose a simple but effective method to induce a phrase-based model from the monolingual corpora given an automatically-induced translation lexicon or a manually-edited translation dictionary. We apply our method for the domain adaptation task and the extensive experiments show that our proposed method can substantially improve the translation quality. 1
4 0.11213558 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model
Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang
Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.
5 0.10190815 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora
Author: Taesung Lee ; Seung-won Hwang
Abstract: This paper studies the problem of mining named entity translations from comparable corpora with some “asymmetry”. Unlike the previous approaches relying on the “symmetry” found in parallel corpora, the proposed method is tolerant to asymmetry often found in comparable corpora, by distinguishing different semantics of relations of entity pairs to selectively propagate seed entity translations on weakly comparable corpora. Our experimental results on English-Chinese corpora show that our selective propagation approach outperforms the previous approaches in named entity translation in terms of the mean reciprocal rank by up to 0.16 for organization names, and 0.14 in a low com- parability case.
6 0.10097215 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context
7 0.09734641 360 acl-2013-Translating Italian connectives into Italian Sign Language
8 0.092607282 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
9 0.087908566 255 acl-2013-Name-aware Machine Translation
10 0.086451814 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
11 0.082700662 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
12 0.082583487 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
13 0.081344023 154 acl-2013-Extracting bilingual terminologies from comparable corpora
14 0.080595791 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation
15 0.078771047 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
16 0.07664068 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
17 0.074470244 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
18 0.074429102 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
19 0.074200168 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
20 0.072324805 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD
topicId topicWeight
[(0, 0.18), (1, -0.051), (2, 0.109), (3, -0.009), (4, 0.058), (5, -0.073), (6, -0.1), (7, 0.012), (8, 0.051), (9, -0.012), (10, 0.016), (11, 0.023), (12, -0.025), (13, 0.033), (14, 0.063), (15, -0.022), (16, -0.009), (17, -0.047), (18, -0.09), (19, 0.013), (20, -0.005), (21, -0.046), (22, -0.03), (23, -0.007), (24, -0.053), (25, -0.069), (26, 0.003), (27, 0.04), (28, 0.082), (29, 0.03), (30, -0.015), (31, -0.022), (32, 0.074), (33, 0.026), (34, 0.033), (35, 0.003), (36, -0.056), (37, 0.045), (38, 0.031), (39, -0.046), (40, -0.026), (41, -0.006), (42, -0.006), (43, -0.071), (44, -0.031), (45, 0.061), (46, 0.013), (47, 0.029), (48, -0.063), (49, -0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.94117945 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon
Abstract: Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.
2 0.77692199 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
Author: Dhouha Bouamor ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the unresolved problem of polysemous words revealed by the bilingual dictionary and introduce a use of a Word Sense Disambiguation process that aims at improving the adequacy of context vectors. On two specialized FrenchEnglish comparable corpora, empirical experimental results show that our method improves the results obtained by two stateof-the-art approaches.
3 0.71709836 255 acl-2013-Name-aware Machine Translation
Author: Haibo Li ; Jing Zheng ; Heng Ji ; Qi Li ; Wen Wang
Abstract: We propose a Name-aware Machine Translation (MT) approach which can tightly integrate name processing into MT model, by jointly annotating parallel corpora, extracting name-aware translation grammar and rules, adding name phrase table and name translation driven decoding. Additionally, we also propose a new MT metric to appropriately evaluate the translation quality of informative words, by assigning different weights to different words according to their importance values in a document. Experiments on Chinese-English translation demonstrated the effectiveness of our approach on enhancing the quality of overall translation, name translation and word alignment over a high-quality MT baseline1 .
4 0.69381517 360 acl-2013-Translating Italian connectives into Italian Sign Language
Author: Camillo Lugaresi ; Barbara Di Eugenio
Abstract: We present a corpus analysis of how Italian connectives are translated into LIS, the Italian Sign Language. Since corpus resources are scarce, we propose an alignment method between the syntactic trees of the Italian sentence and of its LIS translation. This method, and clustering applied to its outputs, highlight the different ways a connective can be rendered in LIS: with a corresponding sign, by affecting the location or shape of other signs, or being omitted altogether. We translate these findings into a computational model that will be integrated into the pipeline of an existing Italian-LIS rendering system. Initial experiments to learn the four possible translations with Decision Trees give promising results.
5 0.68738937 154 acl-2013-Extracting bilingual terminologies from comparable corpora
Author: Ahmet Aker ; Monica Paramita ; Rob Gaizauskas
Abstract: In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. The performance of our classifier reaches the 100% precision level for many language pairs. We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora. The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.
6 0.67251402 64 acl-2013-Automatically Predicting Sentence Translation Difficulty
7 0.66190946 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context
8 0.65999836 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain
9 0.6582396 71 acl-2013-Bootstrapping Entity Translation on Weakly Comparable Corpora
10 0.65316689 305 acl-2013-SORT: An Interactive Source-Rewriting Tool for Improved Translation
11 0.64547944 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain
12 0.64288902 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
13 0.63133067 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
14 0.62304777 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
15 0.62140262 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
16 0.6189146 16 acl-2013-A Novel Translation Framework Based on Rhetorical Structure Theory
17 0.61505675 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
18 0.60519969 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
19 0.60266113 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation
20 0.59830213 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk
topicId topicWeight
[(0, 0.043), (6, 0.033), (11, 0.051), (12, 0.335), (15, 0.014), (24, 0.056), (26, 0.039), (35, 0.053), (42, 0.033), (48, 0.036), (56, 0.015), (64, 0.014), (70, 0.028), (88, 0.044), (90, 0.026), (95, 0.095)]
simIndex simValue paperId paperTitle
same-paper 1 0.71967494 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon
Abstract: Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.
2 0.62532806 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
Author: Masato Hagiwara ; Satoshi Sekine
Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
3 0.62285948 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
Author: Dhouha Bouamor ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the unresolved problem of polysemous words revealed by the bilingual dictionary and introduce a use of a Word Sense Disambiguation process that aims at improving the adequacy of context vectors. On two specialized FrenchEnglish comparable corpora, empirical experimental results show that our method improves the results obtained by two stateof-the-art approaches.
4 0.42911154 267 acl-2013-PARMA: A Predicate Argument Aligner
Author: Travis Wolfe ; Benjamin Van Durme ; Mark Dredze ; Nicholas Andrews ; Charley Beller ; Chris Callison-Burch ; Jay DeYoung ; Justin Snyder ; Jonathan Weese ; Tan Xu ; Xuchen Yao
Abstract: We introduce PARMA, a system for crossdocument, semantic predicate and argument alignment. Our system combines a number of linguistic resources familiar to researchers in areas such as recognizing textual entailment and question answering, integrating them into a simple discriminative model. PARMA achieves state of the art results on an existing and a new dataset. We suggest that previous efforts have focussed on data that is biased and too easy, and we provide a more difficult dataset based on translation data with a low baseline which we beat by 17% F1.
5 0.42812169 97 acl-2013-Cross-lingual Projections between Languages from Different Families
Author: Mo Yu ; Tiejun Zhao ; Yalong Bai ; Hao Tian ; Dianhai Yu
Abstract: Cross-lingual projection methods can benefit from resource-rich languages to improve performances of NLP tasks in resources-scarce languages. However, these methods confronted the difficulty of syntactic differences between languages especially when the pair of languages varies greatly. To make the projection method well-generalize to diverse languages pairs, we enhance the projection method based on word alignments by introducing target-language word representations as features and proposing a novel noise removing method based on these word representations. Experiments showed that our methods improve the performances greatly on projections between English and Chinese.
6 0.42783755 240 acl-2013-Microblogs as Parallel Corpora
7 0.42740628 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction
8 0.42564631 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning
9 0.42513183 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl
10 0.42473596 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain
11 0.42367211 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting
12 0.4219377 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction
13 0.42170271 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
14 0.42149767 333 acl-2013-Summarization Through Submodularity and Dispersion
15 0.42110157 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD
16 0.42099899 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
17 0.42099148 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
18 0.42095333 154 acl-2013-Extracting bilingual terminologies from comparable corpora
19 0.4206003 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks
20 0.42020896 130 acl-2013-Domain-Specific Coreference Resolution with Lexicalized Features