acl acl2011 acl2011-259 knowledge-graph by maker-knowledge-mining

259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents


Source: pdf

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hk Abstract We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. [sent-2, score-1.456]

2 We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. [sent-3, score-0.656]

3 We test our hypothesis on different pairs of languages and corpora. [sent-4, score-0.266]

4 We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). [sent-5, score-0.704]

5 Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. [sent-6, score-0.797]

6 The logical consequence is that in any corpus, there are very few frequent words and many rare words. [sent-10, score-0.41]

7 We propose a novel approach to extract rare word translations from comparable corpora, relying on two main features. [sent-11, score-0.903]

8 The first feature is the context-vector similarity (Fung, 2000; Chiao and Zweigenbaum, 2002; 1327 Laroche and Langlais, 2010): each word is characterized by its context in both source and target corpora, words in translation should have similar context in both languages. [sent-12, score-0.416]

9 The second feature follows the assumption that specific terms and their translations should appear together often in documents on the same topic, and rarely in non-related documents. [sent-13, score-0.651]

10 This is the general assumption behind early work on bilingual lexicon extraction from parallel documents using sentence boundary as the context window size for cooccurrence computation, we suggest to extend it to aligned comparable documents using document as the context window. [sent-14, score-1.526]

11 This document context is too large for co-occurrence computation of functional words or high frequency content words, but we show through observations and experiments that this window size is appropriate for rare words. [sent-15, score-0.471]

12 Moreover, we suggest that the model trained for one pair of languages can be successfully applied to extract translations from another pair of languages. [sent-18, score-0.805]

13 In the next section, we discuss the challenge of rare lexicon extraction, explaining the reasons why classic approaches on comparable corpora fail at dealing with rare words. [sent-20, score-1.186]

14 We then discuss in section 3 the concept of aligned comparable documents and how we exploited those documents for bilingual lexicon extraction in section 4. [sent-21, score-1.246]

15 c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s327–1335, 2 The challenge of rare lexicon extraction There are few previous works focusing on the extraction of rare word translations, especially from comparable corpora. [sent-25, score-1.176]

16 They emphasized the fact that the context-vector based approach, used for processing comparable corpora, perform quite unreliably on all but the most frequent words. [sent-28, score-0.294]

17 In a nutshell1 , this approach proceeds by gathering the context of words in source and target languages inside context-vectors, then compares source and target context-vectors using similarity measures. [sent-29, score-0.426]

18 , 2006) used six pairs of comparable corpora, ranking translations according to their frequencies. [sent-34, score-0.68]

19 We ran a similar experiment using a FrenchEnglish comparable corpus containing medical documents, all related to the topic of breast cancer, all manually classified as scientific discourse. [sent-36, score-0.44]

20 Using an implementation of the context-vector similarity, we show in figure 1 that frequent words (above 400 occurrences in the corpus) reach a 60% precision whereas rare words (below 15 occur- rences) are correctly aligned in only 5% of the time. [sent-40, score-0.842]

21 1328 Figure 1: Results for context-vector based translations extraction with respect to word frequency. [sent-43, score-0.414]

22 The vertical axis is the amount of correct translations found for Top1, and the horizontal axis is the word occurrences in the corpus. [sent-44, score-0.552]

23 Less frequent words, such as abnormality (between 70 and 16 occurrences in each sample) have very unstable context-vectors, hence a lower similarity across the subsets. [sent-48, score-0.421]

24 These comparable documents, when concatenated together in order, form an aligned comparable corpus. [sent-51, score-0.765]

25 Examples of such aligned documents can be found, for example in (Munteanu and Marcu, 2005): they aligned comparable documents with close publication dates. [sent-52, score-1.176]

26 (Tao and Zhai, 2005) used an iterative, bootstrapping approach to align comparable documents using examples of already aligned corpora. [sent-53, score-0.768]

27 , 2010) aligned documents from Wikipedia following the interlingual links provided on articles. [sent-55, score-0.519]

28 We take advantage of this alignment between documents: by looking at what is common between two aligned documents and what is different in other documents, we obtain more precise information about terms than when using a larger comparable corpus without alignment. [sent-56, score-0.773]

29 This is especially – – interesting in the case of rare lexicon as the classic context-vector similarity is not discriminatory enough and fails at raising interesting translation for rare words. [sent-57, score-1.193]

30 4 Rare word translations from aligned comparable documents 4. [sent-58, score-1.054]

31 1 Co-occurrence model Different approaches have been proposed for bilingual lexicon extraction from parallel corpora, relying on the assumption that a word has one sense, one translation, no missing translation, and that its translation appears in aligned parallel sentences (Fung, 2000). [sent-59, score-0.943]

32 Therefore, translations can be extracted by comparing the distribution of words across the sentences. [sent-60, score-0.383]

33 For example, (Gale and Church, 1991) used a derivative of the χ2 statistics to evaluate the association between words in aligned region of parallel documents. [sent-61, score-0.39]

34 In the case of parallel sentences and lexicon extraction, they measure how often two words appear in aligned sentences, and how often one appears without the other. [sent-63, score-0.655]

35 We focus in this work on rare words, more precisely on specialized terminology. [sent-66, score-0.374]

36 We use a strategy similar to the one applied on parallel sentences, but rely on aligned documents. [sent-68, score-0.354]

37 Our hypothesis is very similar: words in translation should appear in aligned comparable documents. [sent-69, score-0.751]

38 1) to evaluate the association between words among aligned comparable documents. [sent-71, score-0.531]

39 In the general case, this measure would not give relevant scores due to frequency issue: it produces the same scores for two words that appear always together, and never one without the other, disregarding the fact that they appear 500 times or one time only. [sent-72, score-0.235]

40 2 Context-vector similarity We implemented the context-vector similarity in a way similar to (Morin et al. [sent-79, score-0.302]

41 3 (2) (3) Binary classification of rare translations We suggest to incorporate both the context-vector similarity and the co-occurrence features in a machine learning approach. [sent-89, score-0.899]

42 This approach consists of training a classifier on positive examples of translation pairs, and negative examples of non-translations pairs. [sent-90, score-0.337]

43 One potential problem for building the training set, as pointed out for example by (Zhao and Ng, 2007) is this: we have a limited number of pos- itive examples, but a very large amount of nontranslation examples as obviously is the case for rare word translations in any training corpus. [sent-92, score-0.726]

44 Including two many negative examples in the training set would lead the classifier to label every pairs as ”Non-Translation”. [sent-93, score-0.243]

45 As most of the negative examples have a null co-occurrence score and a null context-vector similarity, they are excluded from the training set. [sent-98, score-0.227]

46 The negative examples are randomly chosen among those that fulfill the following constraints: • non-null features ; • ratio of number of occurrences between source/target words higher than 0. [sent-99, score-0.299]

47 Features are computed using the Jaccard similarity (section 3) for the co-occurrence model, and the implementation ofthe context-vector similarity presented in section 4. [sent-103, score-0.336]

48 4 Extension to another pair of languages Even though the context vector similarity has been shown to achieve different accuracy depending on the pair of languages involved, the co-occurrence model is totally language independent. [sent-106, score-0.709]

49 In the case of binary classification of translations, the two models are complementary to each other: word pairs with null co-occurrence are not considered by the context model while the context vector model gives more semantic information than the co-occurrence model. [sent-107, score-0.268]

50 For these reasons, we suggest that it is possible to use a decision tree trained on one pair of lan- guages to extract translations from another pair of languages. [sent-108, score-0.721]

51 Our approach consists in training a decision tree on a pair of languages and applying this model to the classification of unknown pairs of words in another pair of languages. [sent-111, score-0.613]

52 Such an approach is especially useful for prospecting new translations from less known languages, using a well known language as training. [sent-112, score-0.347]

53 We used the same algorithms and same features as in the previous sections, but used the data computed from one pair of languages as the training set, and the data computed from another pair of languages as the testing set. [sent-113, score-0.592]

54 We obtained about 20,000 aligned documents for each language. [sent-120, score-0.519]

55 1C,3L5I3R0,43]920E847 n1[C,L126I5R0,83]023Z143h570 (unpublished) that seeks for comparable and parallel documents from the web. [sent-125, score-0.547]

56 Starting from a list of Chinese documents (in this case, mostly news articles), we automatically selected English target documents using Cross Language Information Retrieval. [sent-126, score-0.457]

57 About 85% of the paired documents obtained are direct translations (header/footer of web pages apart). [sent-127, score-0.609]

58 However, they will be processed just like aligned comparable documents, that is, we will not take advantage of the structure of the parallel contents to improve accuracy, but will use the exact same approach that we applied for the Wikipedia documents. [sent-128, score-0.592]

59 We gathered about 15,000 pairs of documents employing this method. [sent-129, score-0.307]

60 2 Dictionaries We need a bilingual seed lexicon for the contextvector similarity. [sent-135, score-0.332]

61 We obtained approximately 22,500 Spanish-English translations and 12,000 for Spanish-French. [sent-139, score-0.397]

62 3 Evaluation lists To evaluate our approach, we needed evaluation lists of terms for which translations are already known. [sent-152, score-0.417]

63 4 Oracle translations We looked at the corpora to evaluate how many translation pairs from the evaluation lists can be found across the aligned comparable documents. [sent-157, score-1.206]

64 Segmentation tools usually rely on a training corpus and typically fail at handling rare words which, by definition, were unlikely to be found in the training examples. [sent-162, score-0.387]

65 Therefore, some rare Chinese tokens found in our corpus are the results offaulty segmentation, and the translation of those faulty words can not be found in related documents. [sent-163, score-0.515]

66 gov/research/umls/ Figure 2: Experiment I: comparison of accuracy obtained for the Top10 with the context-vector similarity and the co-occurrence model, for hapaxes (left) and words that appear 2 to 5 times (right). [sent-171, score-0.518]

67 Experiment III extracts translation from a pair of languages, using a classifier trained on another pair of languages. [sent-172, score-0.453]

68 context-vector similarity We split the French-English part of the Wikipedia corpus into different samples: the first sample contains 500 pairs of documents. [sent-175, score-0.318]

69 We then aggregated more documents to this initial sample to test different sizes of corpora. [sent-176, score-0.251]

70 We built the sample in order to ensure hapaxes in the whole corpus are hapaxes in all subsets. [sent-177, score-0.568]

71 That is, we ensured the 43 1 hapaxes in the evaluation lists are represented in the 500 documents subset. [sent-178, score-0.468]

72 The accuracy is computed on 1,000 pairs of translations from the set of oracle translations, and measures the amount ofcorrect translations found for the 10 best ranks (Top10) after ranking the candidates according to their score (context-vector similarity or co-occurrence model). [sent-182, score-1.032]

73 First, the size of the corpus influences the quality of the bilingual lexicon extraction when using the co-occurrence model. [sent-185, score-0.398]

74 The accuracy is improved by adding more information to 1332 the corpus, even if this additional information does not cover the pairs of translations we are looking for. [sent-187, score-0.442]

75 The added documents will weaken the association of incorrect translations, without changing the association for rare terms translations. [sent-188, score-0.53]

76 For example, the precision for hapaxes using the co-occurrence model ranges from less than 1% when using only 500 pairs of documents, to about 13% when using all documents. [sent-189, score-0.35]

77 We used all the oracle translations to train the positive values. [sent-197, score-0.405]

78 precisionT=|T ∩|T or|acle| (4) recallT=|T|o ∩roa rclaec|le| (5) FMeasure = 2 ×pprreecciissiioonn × + r reeccaall l (6) These results show first that one feature is generally not discriminatory enough to discern correct translation and non-translation pairs. [sent-220, score-0.232]

79 For example with Spanish-English, by using context-vector similarity only, we obtained very high recall/precision for the classification of ”Non-Translation”, but null precision/recall for the classification of ”Transla- tion”. [sent-221, score-0.348]

80 In some other cases, we obtained high precision but poor recall with one feature only, which is 1333 not a usefully result as well since most of the correct translations are still labeled as ”Non-Translation”. [sent-222, score-0.431]

81 The decision trees obtained indicate that, in general, word pairs with very high co-occurrence model scores are translations, and that the context-vector similarity disambiguate candidates with lower cooccurrence model scores. [sent-227, score-0.426]

82 3 Experiment III: extension to another pair of languages In the last experiment, we focused on using the knowledge acquired with a given pair of languages to recognize proper translation pairs using a different pair of languages. [sent-230, score-0.881]

83 These last results are of great interest because they show that translation pairs can be correctly classified even with a classifier trained on another pair of languages. [sent-233, score-0.446]

84 This is very promising because it allows one to prospect new languages using knowledge acquired on a known pairs of languages. [sent-234, score-0.304]

85 This not only confirms the precision/recall of our approach in general, but also shows that the model obtained by training tends to be very stable and accurate across different pairs of languages and different corpora. [sent-236, score-0.284]

86 psi ol imrEcolaenmornglptielui tsnmr yhel pSospi liaopiunrclmo´aiosmrnhpientraı´a compsictohadicbkfrioucentraxmipsloeumxnspvbamsr uteirxca´t igaescnlmiafto lrimzaecsio´n translations found by our algorithm. [sent-250, score-0.347]

87 Note that even though some words such as ”kindergarten” are not rare in general, they occur with very low frequency in the test corpus. [sent-251, score-0.401]

88 1334 7 Conclusion We presented a new approach for extracting translations of rare words among aligned comparable documents. [sent-252, score-1.235]

89 To the best of our knowledge, this is one of the first high accuracy extraction of rare lexicon from non-parallel documents. [sent-253, score-0.553]

90 We also obtained good results for extracting lexicon for a pair of languages, using a decision tree trained with the data computed on another pair of languages. [sent-255, score-0.624]

91 Aligned comparable documents can easily be collected and are available in large volumes. [sent-259, score-0.45]

92 Moreover, the proposed machine learning method incorporating both context-vector and co-occurrence model has shown to give good results on pairs of languages that are very different from each other, such as ChineseEnglish. [sent-260, score-0.234]

93 It is also applicable across different training and testing language pairs, making it possible for us to find rare word translations even for languages without training data. [sent-261, score-0.804]

94 Acknowledgments The authors would like to thank Emmanuel Morin (LINA CNRS 6241) for providing us the comparable corpus used for the experiment in section 2, Simon Shi for extracting and providing the corpus described in section 5. [sent-263, score-0.421]

95 Looking for candidate translational equivalents in specialized, comparable corpora. [sent-272, score-0.238]

96 A statistical view on bilingual lexicon extraction–from parallel corpora to non-parallel corpora. [sent-293, score-0.463]

97 Revisiting context-based projection methods for term-translation spotting in comparable corpora. [sent-314, score-0.238]

98 Bilingual Terminology Mining Using Brain, not brawn comparable corpora. [sent-318, score-0.238]

99 Extracting parallel sentences from comparable corpora using document level alignment. [sent-331, score-0.441]

100 Mining comparable bilingual text corpora for cross-language information integration. [sent-335, score-0.436]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('translations', 0.347), ('rare', 0.318), ('aligned', 0.257), ('comparable', 0.238), ('hapaxes', 0.221), ('documents', 0.212), ('lexicon', 0.168), ('similarity', 0.151), ('languages', 0.139), ('translation', 0.128), ('chiao', 0.107), ('pekar', 0.107), ('corpora', 0.106), ('occurrences', 0.105), ('pair', 0.102), ('parallel', 0.097), ('pairs', 0.095), ('laroche', 0.095), ('morin', 0.095), ('jaccard', 0.095), ('fung', 0.092), ('bilingual', 0.092), ('emmanuel', 0.087), ('experiment', 0.078), ('munteanu', 0.074), ('abnormality', 0.072), ('contextvector', 0.072), ('discriminatory', 0.072), ('tao', 0.071), ('wikipedia', 0.07), ('extraction', 0.067), ('null', 0.063), ('examples', 0.061), ('appear', 0.06), ('oracle', 0.058), ('zweigenbaum', 0.058), ('medical', 0.057), ('ratio', 0.057), ('frequent', 0.056), ('specialized', 0.056), ('decision', 0.055), ('langlais', 0.055), ('alfonseca', 0.055), ('built', 0.054), ('cancer', 0.052), ('pascale', 0.052), ('stefan', 0.05), ('obtained', 0.05), ('axis', 0.05), ('interlingual', 0.05), ('compounding', 0.05), ('classifier', 0.047), ('frequency', 0.047), ('zhao', 0.045), ('fmeasure', 0.045), ('kluwer', 0.045), ('thesaurus', 0.042), ('classification', 0.042), ('another', 0.042), ('segmentation', 0.041), ('suggest', 0.041), ('weka', 0.041), ('negative', 0.04), ('extracting', 0.039), ('sample', 0.039), ('cooccurrence', 0.038), ('classic', 0.038), ('kong', 0.038), ('influences', 0.038), ('promising', 0.038), ('explorations', 0.038), ('french', 0.037), ('lower', 0.037), ('record', 0.037), ('appears', 0.037), ('window', 0.036), ('words', 0.036), ('lists', 0.035), ('precision', 0.034), ('ldc', 0.034), ('academic', 0.034), ('computed', 0.034), ('context', 0.034), ('articles', 0.034), ('cross', 0.034), ('ran', 0.034), ('corpus', 0.033), ('hong', 0.033), ('precise', 0.033), ('target', 0.033), ('yielded', 0.033), ('terminology', 0.032), ('trained', 0.032), ('mining', 0.032), ('hypothesis', 0.032), ('together', 0.032), ('acquired', 0.032), ('gale', 0.032), ('disregarding', 0.032), ('discern', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

2 0.27613506 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

3 0.24575409 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

4 0.21367958 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

Author: Dmitriy Dligach ; Martha Palmer

Abstract: Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation.

5 0.19595315 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens

Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.

6 0.18120423 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

7 0.17462322 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

8 0.15929359 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

9 0.13243237 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

10 0.12904008 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

11 0.12606339 311 acl-2011-Translationese and Its Dialects

12 0.12131053 313 acl-2011-Two Easy Improvements to Lexical Weighting

13 0.11861831 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

14 0.11435229 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

15 0.1116014 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

16 0.1104916 44 acl-2011-An exponential translation model for target language morphology

17 0.1058901 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

18 0.10477127 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

19 0.10396255 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

20 0.10090181 115 acl-2011-Engkoo: Mining the Web for Language Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.305), (1, -0.053), (2, 0.04), (3, 0.189), (4, 0.055), (5, -0.016), (6, 0.115), (7, 0.033), (8, 0.019), (9, -0.007), (10, -0.006), (11, -0.074), (12, 0.067), (13, -0.072), (14, 0.089), (15, -0.052), (16, 0.085), (17, -0.026), (18, 0.124), (19, -0.126), (20, 0.08), (21, -0.067), (22, 0.041), (23, -0.015), (24, -0.021), (25, 0.045), (26, -0.044), (27, 0.132), (28, 0.128), (29, -0.152), (30, 0.056), (31, -0.131), (32, -0.043), (33, -0.026), (34, -0.013), (35, 0.066), (36, 0.062), (37, 0.014), (38, 0.022), (39, -0.028), (40, -0.029), (41, 0.094), (42, 0.064), (43, -0.05), (44, -0.049), (45, -0.004), (46, 0.065), (47, -0.08), (48, 0.067), (49, 0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97157544 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

2 0.93272394 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

3 0.74551022 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

4 0.7263037 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

Author: Els Lefever ; Veronique Hoste ; Martine De Cock

Abstract: This paper describes a set of exploratory experiments for a multilingual classificationbased approach to Word Sense Disambiguation. Instead of using a predefined monolingual sense-inventory such as WordNet, we use a language-independent framework where the word senses are derived automatically from word alignments on a parallel corpus. We built five classifiers with English as an input language and translations in the five supported languages (viz. French, Dutch, Italian, Spanish and German) as classification output. The feature vectors incorporate both the more traditional local context features, as well as binary bag-of-words features that are extracted from the aligned translations. Our results show that the ParaSense multilingual WSD system shows very competitive results compared to the best systems that were evaluated on the SemEval-2010 Cross-Lingual Word Sense Disambiguation task for all five target languages.

5 0.7109189 311 acl-2011-Translationese and Its Dialects

Author: Moshe Koppel ; Noam Ordan

Abstract: While it is has often been observed that the product of translation is somehow different than non-translated text, scholars have emphasized two distinct bases for such differences. Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the process of translation that are independent of source language. Using a series of text categorization experiments, we show that both these effects exist and that, moreover, there is a continuum between them. There are many effects of translation that are consistent among texts translated from a given source language, some of which are consistent even among texts translated from families of source languages. Significantly, we find that even for widely unrelated source languages and multiple genres, differences between translated texts and non-translated texts are sufficient for a learned classifier to accurately determine if a given text is translated or original.

6 0.70644367 115 acl-2011-Engkoo: Mining the Web for Language Learning

7 0.70521939 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

8 0.69251758 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

9 0.66641724 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

10 0.66540396 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

11 0.6444931 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

12 0.61581212 313 acl-2011-Two Easy Improvements to Lexical Weighting

13 0.61361665 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

14 0.60146803 151 acl-2011-Hindi to Punjabi Machine Translation System

15 0.59398788 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

16 0.58159798 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

17 0.54931414 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

18 0.52782857 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

19 0.52777773 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

20 0.51801276 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.018), (17, 0.067), (26, 0.42), (37, 0.075), (39, 0.045), (41, 0.058), (55, 0.023), (59, 0.031), (72, 0.036), (91, 0.036), (96, 0.122), (97, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92268026 105 acl-2011-Dr Sentiment Knows Everything!

Author: Amitava Das ; Sivaji Bandyopadhyay

Abstract: Sentiment analysis is one of the hot demanding research areas since last few decades. Although a formidable amount of research have been done, the existing reported solutions or available systems are still far from perfect or do not meet the satisfaction level of end users’ . The main issue is the various conceptual rules that govern sentiment and there are even more clues (possibly unlimited) that can convey these concepts from realization to verbalization of a human being. Human psychology directly relates to the unrevealed clues and governs the sentiment realization of us. Human psychology relates many things like social psychology, culture, pragmatics and many more endless intelligent aspects of civilization. Proper incorporation of human psychology into computational sentiment knowledge representation may solve the problem. In the present paper we propose a template based online interactive gaming technology, called Dr Sentiment to automatically create the PsychoSentiWordNet involving internet population. The PsychoSentiWordNet is an extension of SentiWordNet that presently holds human psychological knowledge on a few aspects along with sentiment knowledge.

2 0.87738788 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

3 0.83474499 253 acl-2011-PsychoSentiWordNet

Author: Amitava Das

Abstract: Sentiment analysis is one of the hot demanding research areas since last few decades. Although a formidable amount of research has been done but still the existing reported solutions or available systems are far from perfect or to meet the satisfaction level of end user's. The main issue may be there are many conceptual rules that govern sentiment, and there are even more clues (possibly unlimited) that can convey these concepts from realization to verbalization of a human being. Human psychology directly relates to the unrevealed clues; govern the sentiment realization of us. Human psychology relates many things like social psychology, culture, pragmatics and many more endless intelligent aspects of civilization. Proper incorporation of human psychology into computational sentiment knowledge representation may solve the problem. PsychoSentiWordNet is an extension over SentiWordNet that holds human psychological knowledge and sentiment knowledge simultaneously. 1

same-paper 4 0.83431941 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

5 0.77665108 333 acl-2011-Web-Scale Features for Full-Scale Parsing

Author: Mohit Bansal ; Dan Klein

Abstract: Counts from large corpora (like the web) can be powerful syntactic cues. Past work has used web counts to help resolve isolated ambiguities, such as binary noun-verb PP attachments and noun compound bracketings. In this work, we first present a method for generating web count features that address the full range of syntactic attachments. These features encode both surface evidence of lexical affinities as well as paraphrase-based cues to syntactic structure. We then integrate our features into full-scale dependency and constituent parsers. We show relative error reductions of7.0% over the second-order dependency parser of McDonald and Pereira (2006), 9.2% over the constituent parser of Petrov et al. (2006), and 3.4% over a non-local constituent reranker.

6 0.69236231 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

7 0.66793215 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

8 0.60856837 258 acl-2011-Ranking Class Labels Using Query Sessions

9 0.59715652 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

10 0.55854785 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

11 0.55840969 182 acl-2011-Joint Annotation of Search Queries

12 0.54767871 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

13 0.54672956 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

14 0.53599411 256 acl-2011-Query Weighting for Ranking Model Adaptation

15 0.53336793 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

16 0.5147624 193 acl-2011-Language-independent compound splitting with morphological operations

17 0.51384252 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

18 0.51257402 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

19 0.51103526 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

20 0.50843769 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges