emnlp emnlp2013 emnlp2013-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dhouha Bouamor ; Adrian Popescu ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires strong expertise in both languages involved and is a costly process. Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of languages and their performances are still far behind the quality of manual translations. We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia. This massively multilingual encyclopedia makes it possible to create lexicons for a large number of language pairs. Wikipedia is used to extract domains in each language, to link domains between languages and to create generic translation dictionaries. The approach is tested on four specialized domains and is compared to three state of the art approaches using two language pairs: FrenchEnglish and Romanian-English. The newly introduced method compares favorably to existing methods in all configurations tested.
Reference: text
sentIndex sentText sentNum sentScore
1 Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of languages and their performances are still far behind the quality of manual translations. [sent-6, score-0.195]
2 We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia. [sent-7, score-0.447]
3 Wikipedia is used to extract domains in each language, to link domains between languages and to create generic translation dictionaries. [sent-9, score-0.526]
4 The approach is tested on four specialized domains and is compared to three state of the art approaches using two language pairs: FrenchEnglish and Romanian-English. [sent-10, score-0.385]
5 In this paper we focus on the automatic creation of domain-specific bilingual lexicons. [sent-17, score-0.242]
6 The scarcity of such corpora, in particular for specialized domains and for language pairs not involving English, pushed researchers to investigate the use of comparable corpora (Fung, 1998; Chiao and Zweigenbaum, 2003). [sent-20, score-0.518]
7 The basic intuition that underlies bilingual lexicon creation is the distributional hypothesis (Harris, 1954) which puts that words with similar meanings occur in similar contexts. [sent-22, score-0.336]
8 In a multilingual formulation, this hypothesis states that the translations of a word are likely to appear in similar lexical environments across languages (Rapp, 1995). [sent-23, score-0.256]
9 The stan- dard approach to bilingual lexicon extraction builds 1http : / /t rans late . [sent-24, score-0.324]
10 In this approach, the comparison of context vectors is conditioned by the existence of a seed bilingual dictionary. [sent-31, score-0.37]
11 Another important problem occurs whenever the size of the seed dictionary is small due to ignoring many context words. [sent-33, score-0.228]
12 We introduce a bilingual lexicon extraction approach that exploits Wikipedia in an innovative manner in order to tackle some of the problems mentioned above. [sent-35, score-0.324]
13 Important advantages of using Wikipedia are: • The resource is available in hundreds of lan- guages aonudr cite eis i sstr auvacitluarbedle as unambiguous concepts (i. [sent-36, score-0.214]
14 It covers a large number of domains and is thus potentially luarsegfeu nl uinm obredre orf ft od ommianein a awnidde i array of specialized lexicons. [sent-40, score-0.331]
15 i The translation graph is partial since, when considering any language pair, only a part eonf the concepts are available in both languages and explicitly connected. [sent-42, score-0.325]
16 ESA was already successfully tested in different NLP tasks, such as word relatedness estimation or text classification, and we modify it to mine specialized domains, to characterize these domains and to link them across languages. [sent-46, score-0.399]
17 The evaluation of the newly introduced approach is realized on four diversified specialized domains (Breast Cancer, Corporate Finance, Wind Energy and Mobile Technology) and for two pairs of languages: French-English and Romanian-English. [sent-47, score-0.371]
18 1 Standard Approach (SA) Most previous approaches that address bilingual lexicon extraction from comparable corpora are based on the standard approach (Fung, 1998; Chiao and Zweigenbaum, 2002; Laroche and Langlais, 2010). [sent-53, score-0.469]
19 Translation of context vectors: To enable the comparison of source and target vectors, source vectors are translated intoto the target language by using a seed bilingual dictionary. [sent-58, score-0.597]
20 Whenever several translations ofa context word exist, all translation variants are taken into account. [sent-59, score-0.265]
21 Comparison of source and target vectors: Given Wcand, its automatically translated context vector is compared to the context vectors of all possible translations from the target language. [sent-62, score-0.442]
22 2 Improvements of the Standard Approach Most of the improvements of the standard approach are based on the observation that the more representative the context vectors of a candidate word are, the better the bilingual lexicon extraction is. [sent-66, score-0.458]
23 At first, additional linguistic resources, such as specialized dictionaries (Chiao and Zweigenbaum, 2002) or transliterated words (Prochasson et al. [sent-67, score-0.261]
24 , 2009), were combined with the seed dictionary to translate context vectors. [sent-68, score-0.228]
25 The ambiguities that appear in the seed bilingual dictionary were taken into account more recently. [sent-69, score-0.375]
26 Given a context vector in the source language, the most probable translation of polysemous words is identified and used for building the corresponding vector in the target language. [sent-73, score-0.235]
27 On specialized French-English comparable corpora, this approach outperforms the one proposed in (Morin and Prochasson, 2011), which is itself better than the standard approach. [sent-75, score-0.286]
28 Also, the method is highly dependent on the coverage of the seed bilingual dictionary. [sent-78, score-0.277]
29 Some open ESA topics related to bilingual lexicon creation include: (1) the document representation which is simply done by summing individual contributions of words, (2) the adaptation of the method to specific domains and (3) the coverage of the underlying resource in different language. [sent-92, score-0.493]
30 As we mentioned, ESA Figure 1: Overview of the Explicit Semantic Analysis enabled bilingual lexicon extraction. [sent-95, score-0.282]
31 (Gabrilovich and Markovitch, 2007) was exploited in a number of NLP tasks but not in bilingual lexicon extraction. [sent-96, score-0.318]
32 Given a word to be translated and its context vector in the source language, we derive a ranked list of similar Wikipedia concepts (i. [sent-99, score-0.27]
33 Then, a translation graph is used to retrieve the corresponding concepts in the target language. [sent-103, score-0.298]
34 Candidate translations are found through a statistical processing of concept descriptions from the ESA direct index in the target language. [sent-105, score-0.244]
35 482 Then, we detail the three steps that com- pose the main bilingual lexicon extraction method illustrated in Figure 1. [sent-107, score-0.324]
36 Finally, as a complement to the main method we introduce a measure for domain word specificity and present a method for extracting generic translation lexicons. [sent-108, score-0.496]
37 1 ESA Word and Concept Representation Given a semantic space structured using a set of M concepts and including a dictionary of N words, a mapping between words and concepts can be expressed as the following matrix: w(W1, C1) w(W1, C2) w(W2, C1) w(W2, C2) . [sent-110, score-0.386]
38 w(WN, CM) When Wikipedia is exploited concepts are equated to Wikipedia articles and the texts of the articles are processed in order to obtain the weights that link words and concepts. [sent-125, score-0.308]
39 2 Source Language Processing The objective of the source language processing is to obtain a ranked list of similar Wikipedia concepts for each candidate word (Wcand) in a specialized domain. [sent-139, score-0.433]
40 To do this, a context vector is first built for each Wcand from a specialized monolingual corpus. [sent-140, score-0.249]
41 Wikipedia concepts in the source language Cs that are similar to Wcand and to a part of its context words are extracted and ranked using equation 1. [sent-142, score-0.269]
42 In table 1, we present the five most similar Wikipedia concepts to the French terms action, d ´eficit, cisaillement, turbine, cryptage, biopsie and palpation and their context vectors. [sent-145, score-0.291]
43 These terms are part of the four specialized domains we are studying here. [sent-146, score-0.331]
44 From observing these examples, we note that despite the difference between the specialized domains and word ambiguity (words action and protocole), our method has the advantage of successfully representing each word to be translated by relevant conceptual spaces. [sent-147, score-0.474]
45 3 Translation Graph Construction To bridge the gap between the source and target languages, a concept translation graph that enables the multilingual extension of ESA is used. [sent-149, score-0.323]
46 This concept translation graph is extracted from the explicit translation links available in Wikipedia articles and is exploited in order to connect a word’s conceptual space in the source language with the corresponding conceptual space in the target language. [sent-150, score-0.585]
47 Only a part of the articles have translations and the size of the conceptual space in the target language is usually smaller than the space in the source language. [sent-151, score-0.317]
48 For instance, the French-English translation graph contains 940,215 pairs of concepts while the French and English Wikipedias contain approximately 1. [sent-152, score-0.247]
49 These candidate translations Wt are ranked using equation 2. [sent-158, score-0.206]
50 5 Domain Specificity In previous works, ESA was usually exploited in generic tasks that did not require any domain adaptation. [sent-164, score-0.246]
51 Here we process information from specific domains and we need to measure the specificity of words in those domains. [sent-165, score-0.306]
52 The domain extraction is seeded by using Wikipedia concepts (noted Cseed) that best describes the domain in 484 the target language. [sent-166, score-0.459]
53 We extract a set of 10 words with the highest TF-IDF score from this article (noted SW) and use them to retrieve a domain ranking of concepts in the target language Rankdom(Ct) with equation 3. [sent-170, score-0.35]
54 10 items), w(Wti , Ct) is the weight of the domain words in the concept Ct ; w(Cseed, Wti ) is the weight of Wti in Cseed, the seed concept of the domain, and count(SW, Ct) is the number of distinct seed words from SW that appear in Ct. [sent-173, score-0.433]
55 The first part of equation 3 sums up the contributions of different words from SW that appear in Ct while the second part is meant to further reinforce articles that contain a larger number of domain keywords from SW. [sent-174, score-0.219]
56 Given the delimitation obtained with equation 3, we calculate a domain specificity score (specifdom(Wt)) for each word that occurs in the domain ( equation 4). [sent-177, score-0.595]
57 specifdom(Wt) =DDFFdgoemn((WWtt) (4) where DFdom and DFgen stand for the domain and the generic document frequency of the word Wt. [sent-179, score-0.21]
58 specifdom(Wt) will be used to favor words with greater domain specificity over more general ones when several translations are available in a seed generic translation lexicon. [sent-180, score-0.706]
59 In a general case, the most frequent translation is action whereas in a corporate finance context, share or stock are more relevant. [sent-182, score-0.496]
60 The specificity of the three translations, from highest to lowest, is: share, stock and action and is used to rank these potential translations. [sent-183, score-0.277]
61 6 Generic Dictionaries Generic translation dictionaries, already used by existing bilingual lexicon extraction approaches, can also be integrated in the newly proposed approach. [sent-185, score-0.467]
62 The Wikipedia translation graph is transformed into a translation dictionary by removing the disam- biguation marks from ambiguous concept titles, as well as lists, categories and other administration pages. [sent-186, score-0.411]
63 The obtained dictionaries are incomplete since: (1) Wikipedia focuses on concepts that are most often nouns, (2) specialized domain terms often do not have an associated Wikipedia entry and (3) the translation graph covers only a fraction of the concepts available in a language. [sent-189, score-0.794]
64 1 Data and Resources Comparable corpora We conducted our experiments on four FrenchEnglish and Romanian-English specialized comparable corpora: Corporate Finance, Breast Cancer, Wind Energy and Mobile Technology. [sent-198, score-0.353]
65 For the Romanian-English language pair, we used Wikipedia to collect comparable corpora for all domains since they were not already available. [sent-199, score-0.268]
66 The size of the domain corpora vary within and across languages, with the corporate finance domain being the richest in both languages. [sent-210, score-0.588]
67 Bilingual dictionary The seed generic French-English dictionary used to translate French context vectors consists of an in-house manually built resource which contains approximately 120,000 entries. [sent-213, score-0.511]
68 English, we used the generic dictionary extracted following the procedure described in Subsection 3. [sent-218, score-0.197]
69 Gold standard In bilingual terminology extraction from comparable corpora, a reference list is required to evaluate the performance ofthe alignment. [sent-220, score-0.308]
70 For the French-English, reference words from the Corporate Finance domain were extracted from the glossary of bilingual micro-finance terms7. [sent-223, score-0.299]
71 Concerning Wind Energy and Mo- bile Technology, lists were extracted from specialized glossaries found on the Web. [sent-225, score-0.208]
72 3 Results and discussion In addition to the basic approach based on ESA (denoted ESA), we evaluate the performances of a method so-called DicoSpec in which the translations are extracted from a generic dictionary and a method we called ESASpec which combine ESA and DicoSpec. [sent-244, score-0.394]
73 DICOSpec is based on the generic dictionary we presented in subsection 3. [sent-245, score-0.244]
74 6 and proceeds as follows: we extract a list of translations for each word to be translated from the generic dictionary. [sent-246, score-0.265]
75 For instance, the french term port referring in the Mobile Technology domain, to the system that allows computers to receive and transmit information is translated into port and seaport. [sent-249, score-0.201]
76 According to domain specificity values, the following ranking is obtained: the English term port obtain the highest specificity value (0. [sent-250, score-0.53]
77 In ESASpec, the translations set out in the translations lists proposed by both ESA and the generic dictionary are weighted according to their domain specificity values. [sent-254, score-0.733]
78 The main intuition behind this method is that by adding the information about the domain specificity, we obtain a new ranking of the bilingual extraction results. [sent-255, score-0.341]
79 The comparison of state of the art method shows that BA13 performs better than STAPP and MP1 1 for French-English and has comparable performances Table 4: Results ofthe specialized dictionary creation on four specific domains, two pairs oflanguages. [sent-257, score-0.568]
80 Dicospec exploits a generic dictionary, combined with the use of domain specificity (see Subsection 3. [sent-260, score-0.393]
81 The results presented in Table 4 show that ESAspec clearly outperforms the three baselines for the four domains and the two pairs of languages tested. [sent-266, score-0.201]
82 Also, except for the Corporate Finance domain in Romanian, the performance variation across domains is much smaller for ESAspec than for the three state of the art methods. [sent-273, score-0.288]
83 The main contribution to ESAspec performances comes from ESA, a finding that validates our assumption that the adequate use of a rich multilingual resource such as Wikipedia is appropriate for specialized lexicon translation. [sent-276, score-0.516]
84 Dicospec is a sim487 ple method that ranks the different meanings of a candidate word available in a generic dictionary but its average performances are comparable to those of BA13 for FR-EN and higher for RO-EN. [sent-277, score-0.392]
85 This finding advocates for the importance of good quality generic dictionaries in specialized lexicon translation approaches. [sent-278, score-0.557]
86 5 Conclusion We have presented a new approach to the creation of specialized bilingual lexicons, one of the central building blocks of machine translation systems. [sent-285, score-0.553]
87 The scarcity of resources is addressed by an adequate exploitation of Wikipedia, a resource that is available in hundreds of languages. [sent-287, score-0.2]
88 The quality of automatic translations was improved by appropriate domain delimitation and linking across languages, as well as by an adequate statistical processing of concepts similar to a word in a given context. [sent-288, score-0.494]
89 The main advantages of our approach compared to state of the art methods come from: the increased number of languages that can be processed, from the smaller sensitivity to structured resources and the appropriate domain delimitation. [sent-289, score-0.284]
90 Experimental validation is obtained through evaluation with four different domains and two pairs of languages which shows consistent performance improvement. [sent-290, score-0.232]
91 For French-English, two languages that have rich associated Wikipedia representations, performances are very interesting and are starting to approach those of manual translations for three domains out of four (FMeasure@20 around 0. [sent-291, score-0.398]
92 First, we will pursue the integration of our method, notably through comparable corpora creation using the data driven domain delimitation technique described in Subsection 3. [sent-297, score-0.381]
93 Equally important, the size of the domain can be adapted so as to find enough context for all the words in domain reference lists. [sent-299, score-0.263]
94 Second, given a word in a context, we currently exploit all similar concepts from the target language. [sent-300, score-0.195]
95 Given that comparability of article versions in the source and the target language varies, we will evaluate algorithms for filtering out concepts from the target language that have low alignment with their source language versions. [sent-301, score-0.357]
96 A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. [sent-321, score-0.349]
97 Adaptive dictionary for bilingual lexicon extraction from comparable corpora. [sent-345, score-0.5]
98 Bilingual lexicon extraction from comparable corpora using in-domain terms. [sent-349, score-0.281]
99 Bilingual lexicon extraction from comparable corpora enhanced with parallel corpora. [sent-362, score-0.281]
100 Anchor points for bilingual lexicon extraction from small comparable corpora. [sent-372, score-0.402]
wordName wordTfidf (topN-words)
[('esa', 0.472), ('wcand', 0.248), ('specialized', 0.208), ('bilingual', 0.188), ('specificity', 0.183), ('wikipedia', 0.166), ('esaspec', 0.16), ('corporate', 0.155), ('morin', 0.154), ('concepts', 0.144), ('finance', 0.144), ('domains', 0.123), ('translations', 0.121), ('prochasson', 0.113), ('domain', 0.111), ('chiao', 0.106), ('dicospec', 0.106), ('translation', 0.103), ('gabrilovich', 0.099), ('generic', 0.099), ('dictionary', 0.098), ('wt', 0.095), ('lexicon', 0.094), ('breast', 0.093), ('cancer', 0.092), ('seed', 0.089), ('bouamor', 0.089), ('wind', 0.089), ('cti', 0.087), ('markovitch', 0.079), ('comparable', 0.078), ('languages', 0.078), ('performances', 0.076), ('concept', 0.072), ('cseed', 0.071), ('delimitation', 0.071), ('rankdom', 0.071), ('specifdom', 0.071), ('zweigenbaum', 0.071), ('relatedness', 0.068), ('emmanuel', 0.067), ('corpora', 0.067), ('wsi', 0.066), ('ct', 0.065), ('articles', 0.064), ('laroche', 0.062), ('wti', 0.062), ('multilingual', 0.057), ('action', 0.057), ('creation', 0.054), ('art', 0.054), ('biopsie', 0.053), ('palpation', 0.053), ('port', 0.053), ('ssa', 0.053), ('turbine', 0.053), ('dictionaries', 0.053), ('vectors', 0.052), ('target', 0.051), ('french', 0.05), ('subsection', 0.047), ('adequate', 0.047), ('romanian', 0.046), ('translated', 0.045), ('equation', 0.044), ('pierre', 0.043), ('sw', 0.042), ('langlais', 0.042), ('scarcity', 0.042), ('extraction', 0.042), ('conceptual', 0.041), ('context', 0.041), ('resources', 0.041), ('candidate', 0.041), ('newly', 0.04), ('source', 0.04), ('cm', 0.037), ('energy', 0.037), ('stock', 0.037), ('hundreds', 0.036), ('exploited', 0.036), ('administration', 0.035), ('cedex', 0.035), ('cryptage', 0.035), ('dhouha', 0.035), ('eficit', 0.035), ('frenchenglish', 0.035), ('halavais', 0.035), ('hazem', 0.035), ('nasredine', 0.035), ('oddswwscaind', 0.035), ('radinsky', 0.035), ('stapp', 0.035), ('tweaks', 0.035), ('explicit', 0.034), ('resource', 0.034), ('mobile', 0.032), ('obtained', 0.031), ('cisaillement', 0.031), ('comparability', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge
Author: Dhouha Bouamor ; Adrian Popescu ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires strong expertise in both languages involved and is a costly process. Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of languages and their performances are still far behind the quality of manual translations. We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia. This massively multilingual encyclopedia makes it possible to create lexicons for a large number of language pairs. Wikipedia is used to extract domains in each language, to link domains between languages and to create generic translation dictionaries. The approach is tested on four specialized domains and is compared to three state of the art approaches using two language pairs: FrenchEnglish and Romanian-English. The newly introduced method compares favorably to existing methods in all configurations tested.
2 0.18386456 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
Author: Ivan Vulic ; Marie-Francine Moens
Abstract: We present a new language pair agnostic approach to inducing bilingual vector spaces from non-parallel data without any other resource in a bootstrapping fashion. The paper systematically introduces and describes all key elements of the bootstrapping procedure: (1) starting point or seed lexicon, (2) the confidence estimation and selection of new dimensions of the space, and (3) convergence. We test the quality of the induced bilingual vector spaces, and analyze the influence of the different components of the bootstrapping approach in the task of bilingual lexicon extraction (BLE) for two language pairs. Results reveal that, contrary to conclusions from prior work, the seeding of the bootstrapping process has a heavy impact on the quality of the learned lexicons. We also show that our approach outperforms the best performing fully corpus-based BLE methods on these test sets.
3 0.17455536 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
4 0.15100378 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
Author: Ann Irvine ; Chris Quirk ; Hal Daume III
Abstract: When using a machine translation (MT) model trained on OLD-domain parallel data to translate NEW-domain text, one major challenge is the large number of out-of-vocabulary (OOV) and new-translation-sense words. We present a method to identify new translations of both known and unknown source language words that uses NEW-domain comparable document pairs. Starting with a joint distribution of source-target word pairs derived from the OLD-domain parallel corpus, our method recovers a new joint distribution that matches the marginal distributions of the NEW-domain comparable document pairs, while minimizing the divergence from the OLD-domain distribution. Adding learned translations to our French-English MT model results in gains of about 2 BLEU points over strong baselines.
5 0.13236761 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
Author: Min Xiao ; Yuhong Guo
Abstract: Cross-lingual adaptation aims to learn a prediction model in a label-scarce target language by exploiting labeled data from a labelrich source language. An effective crosslingual adaptation system can substantially reduce the manual annotation effort required in many natural language processing tasks. In this paper, we propose a new cross-lingual adaptation approach for document classification based on learning cross-lingual discriminative distributed representations of words. Specifically, we propose to maximize the loglikelihood of the documents from both language domains under a cross-lingual logbilinear document model, while minimizing the prediction log-losses of labeled documents. We conduct extensive experiments on cross-lingual sentiment classification tasks of Amazon product reviews. Our experimental results demonstrate the efficacy of the pro- posed cross-lingual adaptation approach.
6 0.12145104 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
7 0.10734521 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
8 0.084059574 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
9 0.080629528 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
10 0.078970432 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora
11 0.076128274 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
12 0.075199179 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
13 0.074206814 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
14 0.072958209 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation
15 0.072903536 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition
16 0.072045527 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation
17 0.068526104 160 emnlp-2013-Relational Inference for Wikification
18 0.067387514 29 emnlp-2013-Automatic Domain Partitioning for Multi-Domain Learning
19 0.065908849 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
20 0.065604754 92 emnlp-2013-Growing Multi-Domain Glossaries from a Few Seeds using Probabilistic Topic Models
topicId topicWeight
[(0, -0.228), (1, -0.078), (2, -0.029), (3, -0.021), (4, 0.099), (5, 0.004), (6, 0.023), (7, 0.039), (8, -0.003), (9, -0.266), (10, 0.063), (11, -0.066), (12, 0.092), (13, -0.013), (14, 0.024), (15, -0.035), (16, -0.107), (17, 0.032), (18, -0.031), (19, 0.078), (20, -0.047), (21, -0.023), (22, 0.0), (23, 0.02), (24, 0.048), (25, -0.245), (26, 0.113), (27, 0.109), (28, 0.042), (29, 0.05), (30, -0.189), (31, -0.044), (32, -0.075), (33, -0.009), (34, -0.049), (35, 0.052), (36, -0.009), (37, 0.037), (38, -0.038), (39, 0.013), (40, 0.095), (41, -0.002), (42, -0.101), (43, -0.048), (44, -0.127), (45, -0.134), (46, 0.038), (47, -0.055), (48, -0.134), (49, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.93860328 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge
Author: Dhouha Bouamor ; Adrian Popescu ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires strong expertise in both languages involved and is a costly process. Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of languages and their performances are still far behind the quality of manual translations. We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia. This massively multilingual encyclopedia makes it possible to create lexicons for a large number of language pairs. Wikipedia is used to extract domains in each language, to link domains between languages and to create generic translation dictionaries. The approach is tested on four specialized domains and is compared to three state of the art approaches using two language pairs: FrenchEnglish and Romanian-English. The newly introduced method compares favorably to existing methods in all configurations tested.
2 0.82734448 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)
Author: Ivan Vulic ; Marie-Francine Moens
Abstract: We present a new language pair agnostic approach to inducing bilingual vector spaces from non-parallel data without any other resource in a bootstrapping fashion. The paper systematically introduces and describes all key elements of the bootstrapping procedure: (1) starting point or seed lexicon, (2) the confidence estimation and selection of new dimensions of the space, and (3) convergence. We test the quality of the induced bilingual vector spaces, and analyze the influence of the different components of the bootstrapping approach in the task of bilingual lexicon extraction (BLE) for two language pairs. Results reveal that, contrary to conclusions from prior work, the seeding of the bootstrapping process has a heavy impact on the quality of the learned lexicons. We also show that our approach outperforms the best performing fully corpus-based BLE methods on these test sets.
3 0.62746924 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation
Author: Ann Irvine ; Chris Quirk ; Hal Daume III
Abstract: When using a machine translation (MT) model trained on OLD-domain parallel data to translate NEW-domain text, one major challenge is the large number of out-of-vocabulary (OOV) and new-translation-sense words. We present a method to identify new translations of both known and unknown source language words that uses NEW-domain comparable document pairs. Starting with a joint distribution of source-target word pairs derived from the OLD-domain parallel corpus, our method recovers a new joint distribution that matches the marginal distributions of the NEW-domain comparable document pairs, while minimizing the divergence from the OLD-domain distribution. Adding learned translations to our French-English MT model results in gains of about 2 BLEU points over strong baselines.
4 0.5371027 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
Author: John Philip McCrae ; Philipp Cimiano ; Roman Klinger
Abstract: Cross-lingual topic modelling has applications in machine translation, word sense disambiguation and terminology alignment. Multilingual extensions of approaches based on latent (LSI), generative (LDA, PLSI) as well as explicit (ESA) topic modelling can induce an interlingual topic space allowing documents in different languages to be mapped into the same space and thus to be compared across languages. In this paper, we present a novel approach that combines latent and explicit topic modelling approaches in the sense that it builds on a set of explicitly defined topics, but then computes latent relations between these. Thus, the method combines the benefits of both explicit and latent topic modelling approaches. We show that on a crosslingual mate retrieval task, our model significantly outperforms LDA, LSI, and ESA, as well as a baseline that translates every word in a document into the target language.
5 0.48398715 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
Author: Will Y. Zou ; Richard Socher ; Daniel Cer ; Christopher D. Manning
Abstract: We introduce bilingual word embeddings: semantic embeddings associated across two languages in the context of neural language models. We propose a method to learn bilingual embeddings from a large unlabeled corpus, while utilizing MT word alignments to constrain translational equivalence. The new embeddings significantly out-perform baselines in word semantic similarity. A single semantic similarity feature induced with bilingual embeddings adds near half a BLEU point to the results of NIST08 Chinese-English machine translation task.
6 0.46537694 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
7 0.44610298 92 emnlp-2013-Growing Multi-Domain Glossaries from a Few Seeds using Probabilistic Topic Models
8 0.42985716 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
9 0.38962388 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation
10 0.38448551 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk
11 0.37954155 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication
12 0.37385669 24 emnlp-2013-Application of Localized Similarity for Web Documents
13 0.36697334 39 emnlp-2013-Boosting Cross-Language Retrieval by Learning Bilingual Phrase Associations from Relevance Rankings
14 0.35421339 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
15 0.33581102 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora
16 0.32896888 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition
17 0.32774064 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation
18 0.32753393 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity
19 0.31622159 139 emnlp-2013-Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora
20 0.31009835 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models
topicId topicWeight
[(3, 0.049), (18, 0.029), (22, 0.078), (30, 0.068), (43, 0.338), (50, 0.013), (51, 0.182), (66, 0.035), (71, 0.034), (75, 0.022), (77, 0.049), (96, 0.019)]
simIndex simValue paperId paperTitle
1 0.88387805 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations
Author: Joo-Kyung Kim ; Marie-Catherine de Marneffe
Abstract: Continuous space word representations extracted from neural network language models have been used effectively for natural language processing, but until recently it was not clear whether the spatial relationships of such representations were interpretable. Mikolov et al. (2013) show that these representations do capture syntactic and semantic regularities. Here, we push the interpretation of continuous space word representations further by demonstrating that vector offsets can be used to derive adjectival scales (e.g., okay < good < excellent). We evaluate the scales on the indirect answers to yes/no questions corpus (de Marneffe et al., 2010). We obtain 72.8% accuracy, which outperforms previous results (∼60%) on tichihs corpus aornmd highlights sth rees quality o6f0% the) scales extracted, providing further support that the continuous space word representations are meaningful.
2 0.78129768 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation
Author: Ming Tan ; Tian Xia ; Shaojun Wang ; Bowen Zhou
Abstract: MIRA based tuning methods have been widely used in statistical machine translation (SMT) system with a large number of features. Since the corpus-level BLEU is not decomposable, these MIRA approaches usually define a variety of heuristic-driven sentencelevel BLEUs in their model losses. Instead, we present a new MIRA method, which employs an exact corpus-level BLEU to compute the model loss. Our method is simpler in implementation. Experiments on Chinese-toEnglish translation show its effectiveness over two state-of-the-art MIRA implementations.
same-paper 3 0.77146959 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge
Author: Dhouha Bouamor ; Adrian Popescu ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: Bilingual lexicons are central components of machine translation and cross-lingual information retrieval systems. Their manual construction requires strong expertise in both languages involved and is a costly process. Several automatic methods were proposed as an alternative but they often rely on resources available in a limited number of languages and their performances are still far behind the quality of manual translations. We introduce a novel approach to the creation of specific domain bilingual lexicon that relies on Wikipedia. This massively multilingual encyclopedia makes it possible to create lexicons for a large number of language pairs. Wikipedia is used to extract domains in each language, to link domains between languages and to create generic translation dictionaries. The approach is tested on four specialized domains and is compared to three state of the art approaches using two language pairs: FrenchEnglish and Romanian-English. The newly introduced method compares favorably to existing methods in all configurations tested.
4 0.54392946 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
Author: Alla Rozovskaya ; Dan Roth
Abstract: State-of-the-art systems for grammatical error correction are based on a collection of independently-trained models for specific errors. Such models ignore linguistic interactions at the sentence level and thus do poorly on mistakes that involve grammatical dependencies among several words. In this paper, we identify linguistic structures with interacting grammatical properties and propose to address such dependencies via joint inference and joint learning. We show that it is possible to identify interactions well enough to facilitate a joint approach and, consequently, that joint methods correct incoherent predictions that independentlytrained classifiers tend to produce. Furthermore, because the joint learning model considers interacting phenomena during training, it is able to identify mistakes that require mak- ing multiple changes simultaneously and that standard approaches miss. Overall, our model significantly outperforms the Illinois system that placed first in the CoNLL-2013 shared task on grammatical error correction.
5 0.54015118 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
Author: Zhiyuan Chen ; Arjun Mukherjee ; Bing Liu ; Meichun Hsu ; Malu Castellanos ; Riddhiman Ghosh
Abstract: Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have been used for the task. However, such models without any domain knowledge often produce aspects that are not interpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. However, existing knowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). Experiments on real-life product reviews from a variety of domains show that MCLDA outperforms the existing state-of-the-art models markedly.
6 0.53979576 137 emnlp-2013-Multi-Relational Latent Semantic Analysis
8 0.53861636 152 emnlp-2013-Predicting the Presence of Discourse Connectives
9 0.537067 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
10 0.53692746 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
11 0.53683394 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
12 0.53619874 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
13 0.5357542 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology
14 0.53549123 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation
15 0.53511977 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
16 0.53485161 156 emnlp-2013-Recurrent Continuous Translation Models
17 0.53418392 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
19 0.53405994 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
20 0.53401893 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction