acl acl2011 acl2011-70 knowledge-graph by maker-knowledge-mining

70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction


Source: pdf

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. [sent-2, score-1.962]

2 We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. [sent-3, score-0.814]

3 Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches. [sent-4, score-1.179]

4 fr Bilingual lexicons are an important resource in multilingual natural language processing tasks such as statistical machine translation (Och and Ney, 2003) and cross-language information retrieval (Ballesteros and Croft, 1997). [sent-6, score-0.294]

5 Because it is expensive to manually build bilingual lexicons adapted to different domains, researchers have tried to automatically extract bilingual lexicons from various corpora. [sent-7, score-1.188]

6 Compared with parallel corpora, it is much easier to build high-volume comparable corpora, i. [sent-8, score-0.22]

7 corpora consisting of documents in different languages covering overlapping information. [sent-10, score-0.302]

8 Several studies have focused on the extraction of bilingual lexicons from comparable corpora (Fung and McKeown, 1997; Fung and Yee, 1998; Rapp, 1999; D ´ejean et al. [sent-11, score-1.119]

9 , 2009; 473 Akiko Aizawa National Institute of Informatics Tokyo, Japan ai z awa @ nii . [sent-16, score-0.056]

10 The basic assumption behind most studies on lexicon extraction from comparable corpora is a distributional hypothesis, stating that words which are translation of each other are likely to appear in similar context across languages. [sent-19, score-0.708]

11 More recently, and departing from such traditional approaches, we have proposed in (Li and Gaussier, 2010) an approach based on improving the comparability of the corpus under consideration, prior to extracting bilingual lexicons. [sent-22, score-0.869]

12 This approach is interesting since there is no point in trying to extract lexicons from a corpus with a low degree of comparability, as the probability of finding translations of any given word is low in such cases. [sent-23, score-0.401]

13 We follow here the same general idea and aim, in a first step, at improving the comparability of a given corpus while preserving most of its vocabulary. [sent-24, score-0.521]

14 However, unlike the previous work, we show here that it is possible to guarantee a cer- tain degree of homogeneity for the improved corpus, and that this homogeneity translates into a significant improvement of both the quality of the resulting corpora and the bilingual lexicons extracted. [sent-25, score-1.231]

15 2 Enhancing Comparable Corpora: A Clustering Approach We first introduce in this section the comparability measure proposed in former work, prior to describing the clustering-based algorithm to improve the Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-26, score-0.472]

16 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 473–478, quality of a given comparable corpus. [sent-28, score-0.28]

17 For convenience, the following discussion will be made in the context of the English-French comparable corpus. [sent-29, score-0.22]

18 Let σ be a function indicating whether a translation from the translation set Tw of the word w is found in the vocabulary Pv oseft a corpus P, io. [sent-32, score-0.223]

19 01 eiflfs Tew∩ Pv6= ∅ and let D be a bilingual dictionary with Dev denoting iatns English vocabulary lan ddic Dfv niatsr yFr wenithch D vocabulary. [sent-35, score-0.453]

20 iTtsheE comparability measure M can be written as: M(Pe, Pf) = (1) Pw∈Pe∩Dev σ(w,Pf) + Pw∈Pf∩Dfv σ(w,Pe) P#w(Pe∩ Dev) +P #w(Pf∩ Dfv) where #w (P) denotes the number of different words present i nd Pno. [sent-36, score-0.472]

21 t sO tnhee can fbinerd ofrfom di equatwioonr 1s t phraets Ment directly measures the proportion of source/target words translated in the target/source vocabulary of P. [sent-37, score-0.101]

22 2 Clustering Documents for High Quality Comparable Corpora If a corpus covers a limited set of topics, it is more likely to contain consistent information on the words used (Morin et al. [sent-39, score-0.088]

23 , 2007), leading to improved bilingual lexicons extracted with existing algorithms relying on the distributional hypothesis. [sent-40, score-0.594]

24 The term homogeneity directly refers to this fact, and we will say, in an informal manner, that a corpus is homogeneous if it covers a limited set of topics. [sent-41, score-0.417]

25 The rationale for the algorithm we introduce here to enhance corpus comparability is precisely based on the concept of homogeneity. [sent-42, score-0.5]

26 In order to find document sets which are similar with each other (i. [sent-43, score-0.04]

27 homogeneous), it 474 is natural to resort to clustering techniques. [sent-45, score-0.084]

28 Furthermore, since we need homogeneous corpora for bilingual lexicon extraction, it will be convenient to rely on techniques which allows one to easily prune less relevant clusters. [sent-46, score-0.87]

29 To perform all this, we use in this work a standard hierarchical agglomerative clustering method. [sent-47, score-0.084]

30 1 Bilingual Clustering Algorithm The overall process retained to build high quality, homogeneous comparable corpora relies on the fol- lowing steps: 1. [sent-50, score-0.607]

31 Using the bilingual similarity measure defined in Section 2. [sent-51, score-0.457]

32 2, cluster English and French documents so as to get bilingual dendrograms from the original corpus P by grouping docufmroemnts t hweit ohr rigeilantaeld c content; 2. [sent-53, score-0.586]

33 Pick high quality sub-clusters by thresholding the obtained dendrograms according to the node depth, which retains nodes far from the roots of the clustering trees; 3. [sent-54, score-0.284]

34 Combine all these sub-clusters to form a new comparable corpus PH, which thus contains homogeneous, high-quality subparts; 4. [sent-55, score-0.269]

35 Use again steps (1), (2) and (3) to enrich the remaining subpart of P (denoted as PL, PL = rPe \ PH) gw situhb pexatretr onfal P resources. [sent-56, score-0.064]

36 The first three steps aim at extracting the most comparable and homogeneous subpart of P. [sent-57, score-0.498]

37 Once this hpaarsa bbeleen a done, one nneeoeduss stou b rpeasorrtt o tfo P new corpora if one wants to build an homogeneous corpus with a high degree of comparability from PL. [sent-58, score-0.904]

38 The tFwreon high quality subparts o Ebntagilnisehd farrotm o ft hPese two new comparable corpora in step (4) are then combined with PH to constitute the final comparable corpus woift higher quality. [sent-60, score-0.861]

39 2 Similarity Measure Let us assume that we have two document sets (i. [sent-63, score-0.04]

40 In the task of bilingual lexicon extraction, two document sets are similar to each other and should be clustered if the combination of the two can complement the content of each single set, which relates to the notion of homogeneity. [sent-66, score-0.523]

41 In other words, both the English part C1e of C1 and the French part of C1 should be comparable to their counterparts (respectively the same for the French part C2f of C2 and the English part C2e of C2). [sent-67, score-0.344]

42 C2f) C1f) and (Ce2, should dominate the overall similarity sim(C1 , C2). [sent-69, score-0.047]

43 Since the content relatedness in the comparable corpus is basically reflected by the relations between all the possible bilingual document pairs, we use here the number of document pairs to represent the scale of the comparable corpus. [sent-70, score-0.927]

44 The weight β can thus be defined as the proportion of possible document pairs in the current comparable corpus (Ce1, C2f) to all the possible document pairs, wcohrpicuhs si (s:C β =#d(Ce1) · ##dd((CC2f1e)) + · # #dd((CC2f2e)) · #d(C1f) where #d(C) stands for the number of documents in C. [sent-71, score-0.444]

45 However, )t shtiasn measure ed noeusm nboetr integrate theent rse iln- aCt. [sent-72, score-0.08]

46 iv He length ,o tfh itsh em Ferasenucreh daoneds English parts, hweh rieclhactually impacts the performance of bilingual lexicon extraction. [sent-73, score-0.483]

47 assuming that all clusters should contain the same number of English and French documents), having completely unbalanced corpora is also not desirable. [sent-76, score-0.255]

48 In addition, two monolingual corpora Wiki-En and Wiki-Fr were built by respectively retrieving all the articles below the category Society and Soci e´t e´ from the Wikipedia dump files3. [sent-79, score-0.308]

49 The bilingual dictionary used in the experiments is constructed from an online dictionary. [sent-80, score-0.395]

50 It consists of 33k distinct English words and 28k distinct French words, constituting 76k translation pairs. [sent-81, score-0.058]

51 In our experiments, we use the method described in this paper, as well as the one in (Li and Gaussier, 2010) which is the only alternative method to enhance corpus comparability. [sent-82, score-0.08]

52 1 Improving Corpus Quality In this subsection, the clustering algorithm described in Section 2. [sent-84, score-0.084]

53 1 is employed to improve the quality of the comparable corpus. [sent-86, score-0.28]

54 The corpora GH95 and SDA95 are used as the original corpus P0 (56k English d5o acruem uesendts asn tdh e4 o2kri gFinreanlc cho documents). [sent-87, score-0.299]

55 We consider two external corpora: PT1 (109k English dcoocnsuimdeernt tsw aond e 8x7tekr nFarlen cochrp documents) consisting of the corpora LAT94, MON94 and SDA94; PT2 (368k English odroac LumATe9n4t,s ManOdN 39748 akn dF SreDncAh9 documents) consisting of Wiki-En and Wiki-Fr. [sent-88, score-0.347]

56 org 3The Wikipedia dump files can be downloaded at http://download. [sent-93, score-0.093]

57 In this paper, we use the English dump file on July 13, 2009 and the French dump file on July 7, 2009. [sent-96, score-0.248]

58 0 8% Table 1: Performance of the bilingual lexicon extraction from different corpora (best results in bold) After the clustering process, we obtain the resulting corpora P1 (with the external corpus PT1) and Ping2 (with PT2). [sent-106, score-1.132]

59 A(ws imthen thteion eexdte before, we a Plso used tPhe (mweitthho Pd described in (Li and Gaussier, 2010) on the same data, producing resulting corpora (with PT1) aen dda (with PT2) sfruolmtin gP c0. [sent-107, score-0.243]

60 vocabulary eof v otchea original corpus has been preserved. [sent-114, score-0.142]

61 Breot ohf corpora are more comparable fth Pan P0 of which the comparability rise 0c o. [sent-119, score-0.855]

62 Furthermore, both P1 and P2 are more comparable t Fhuanrt (comparability 0. [sent-121, score-0.22]

63 The intrinsic evaluation shows the efficiency of our approach which can improve the quality of the given corpus while preserving most of its vocabulary. [sent-124, score-0.161]

64 2 Bilingual Lexicon Extraction Experiments To extract bilingual lexicons from comparable corpora, we directly use here the method proposed by Fung and Yee (1998) which has been referred to as the standard approach in more recent studies (D´ ejean et al. [sent-126, score-0.937]

65 In this approach, each word w is represented as a context vector consisting of the words co-occurring with w in a certain window in the corpus. [sent-129, score-0.035]

66 The context vectors in different languages are then bridged with an existing bilingual dictionary. [sent-130, score-0.386]

67 Finally, a similarity score is given to any word pair based on the cosine of their respective context vec- tors. [sent-131, score-0.047]

68 English words not present in Pe or with no translation ignl Pf are desxc nloutd epdre fsreonmt tnhe P evaluation set. [sent-135, score-0.058]

69 For each English word in the evaluation set, all the French words in Pf are then ranked according to their similarity iwni tPh the English word. [sent-136, score-0.047]

70 Precision and recall are then computed on the first N translation candidate lists. [sent-137, score-0.058]

71 The precision amounts in this case to the proportion of lists containing the correct translation (in case of multiple translations, a list is deemed to contain the correct translation as soon as one of the possible translations is present). [sent-138, score-0.258]

72 The recall is the proportion of correct translations found in the lists to all the translations in the corpus. [sent-139, score-0.179]

73 This evaluation procedure has been used in previous studies and is now standard. [sent-140, score-0.038]

74 As one can note, tahtiev eb e dsitf freerseunltcse (in bold) are o Pbtained from the corpora P2 built with the method we have described in tphoirsa paper. [sent-150, score-0.215]

75 The lexicons extracted from the enhanced corpora are of much higher quality than the ones obtained from the original corpus . [sent-151, score-0.629]

76 ff0e%re rnceela ti sv more n re Pmarkable pwairthed dP w2,i twhh Pich is obtained from a large external corpus PT2. [sent-159, score-0.117]

77 Intuitively, one can expect to find, in larger corpora, more documents related to a given corpus, oPa2c0)h (P10 an intuition which seems to be confirmed by our results. [sent-160, score-0.052]

78 One can also notice, by comparing P2 and as Owneell as nP a1l aon ndo a y rem coamrkpaabrlien improvePmenta ws wheenll considering our approach and the early methodology. [sent-161, score-0.033]

79 In a second series of experiments, we let N vary from 1 to 300 and plot the results obtained with different evaluation measure in Figure 1. [sent-163, score-0.086]

80 recall) scores for the lexicons extractedon each ofthe 5 corporaP0, Pco1n asnedx rPac2. [sent-166, score-0.236]

81 As one can note, our method consistently outperforms the previous work and also the original corpus on all the values considered for N. [sent-168, score-0.084]

82 sPu2l0t Ptic1e0,, P10, P20, N (a) Precision N (b) Recall Figure 1: Performance of bilingual lexicon extraction from different corpora with varied N values from 1 to 300. [sent-169, score-0.75]

83 477 4 Discussion As previous studies on bilingual lexicon extraction from comparable corpora radically differ on resources used and technical choices, it is very difficult to compare them in a unified framework (Laroche and Langlais, 2010). [sent-171, score-1.008]

84 enhancing bilingual corpora prior to extracting bilingual lexicons from them). [sent-174, score-1.304]

85 , 2004) and (Munteanu and Marcu, 2006) propose methods to extract parallel fragments from comparable corpora. [sent-176, score-0.22]

86 However, their approach only focuses on a very small part of the original corpus, whereas our work aims at preserving most of the vocabulary of the original corpus. [sent-177, score-0.211]

87 We have followed here the general approach in (Li and Gaussier, 2010) which consists in enhancing the quality of a comparable corpus prior to extracting information from it. [sent-178, score-0.466]

88 However, despite this latter work, we have shown here a method which ensures homogeneity of the obtained corpus, and which finally leads to comparable corpora of higher quality. [sent-179, score-0.626]

89 In turn such corpora yield better bilingual lexicons extracted. [sent-180, score-0.809]

90 Phrasal translation and query expansion techniques for crosslanguage information retrieval. [sent-185, score-0.058]

91 An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. [sent-189, score-0.483]

92 An IR approach for translating new words from nonparallel, comparable texts. [sent-197, score-0.22]

93 Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. [sent-201, score-0.398]

94 A geometric view on bilingual lexicon extraction from comparable corpora. [sent-211, score-0.755]

95 Revisiting context-based projection methods for term-translation spotting in comparable corpora. [sent-215, score-0.22]

96 Improving corpus comparability for bilingual lexicon extraction from comparable corpora. [sent-219, score-1.224]

97 Bilingual terminology mining using brain, not brawn comparable corpora. [sent-223, score-0.277]

98 Improved machine translation performance via parallel sentence extraction from comparable corpora. [sent-231, score-0.33]

99 Automatic identification of word translations from unrelated English and German corpora. [sent-240, score-0.068]

100 Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. [sent-252, score-0.83]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('comparability', 0.42), ('bilingual', 0.358), ('gaussier', 0.288), ('lexicons', 0.236), ('comparable', 0.22), ('corpora', 0.215), ('homogeneous', 0.172), ('homogeneity', 0.157), ('french', 0.157), ('lexicon', 0.125), ('pf', 0.121), ('fung', 0.104), ('dfv', 0.097), ('subparts', 0.097), ('enhancing', 0.095), ('dump', 0.093), ('munteanu', 0.089), ('ejean', 0.085), ('morin', 0.085), ('clustering', 0.084), ('pl', 0.076), ('pe', 0.071), ('translations', 0.068), ('dendrograms', 0.064), ('robitaille', 0.064), ('shezaf', 0.064), ('siml', 0.064), ('subpart', 0.064), ('quality', 0.06), ('translation', 0.058), ('vocabulary', 0.058), ('laroche', 0.057), ('sda', 0.057), ('ph', 0.056), ('li', 0.055), ('sim', 0.054), ('english', 0.054), ('measure', 0.052), ('ballesteros', 0.052), ('garera', 0.052), ('nonparallel', 0.052), ('documents', 0.052), ('preserving', 0.052), ('extraction', 0.052), ('corpus', 0.049), ('degree', 0.048), ('similarity', 0.047), ('pd', 0.047), ('pascale', 0.047), ('dev', 0.045), ('dragos', 0.045), ('proportion', 0.043), ('extracting', 0.042), ('thresholding', 0.042), ('yu', 0.041), ('dd', 0.04), ('unbalanced', 0.04), ('document', 0.04), ('covers', 0.039), ('pthe', 0.039), ('july', 0.038), ('studies', 0.038), ('dictionary', 0.037), ('yee', 0.037), ('tsujii', 0.036), ('consisting', 0.035), ('original', 0.035), ('obtained', 0.034), ('external', 0.034), ('pw', 0.033), ('ws', 0.033), ('intuitively', 0.032), ('eric', 0.032), ('part', 0.031), ('file', 0.031), ('enhance', 0.031), ('precision', 0.031), ('stefan', 0.03), ('terminology', 0.029), ('fir', 0.028), ('hweit', 0.028), ('nii', 0.028), ('yasuhiro', 0.028), ('junichi', 0.028), ('dda', 0.028), ('theent', 0.028), ('akn', 0.028), ('apnd', 0.028), ('ater', 0.028), ('audrey', 0.028), ('awa', 0.028), ('brawn', 0.028), ('bridged', 0.028), ('compiling', 0.028), ('daille', 0.028), ('herald', 0.028), ('iiosn', 0.028), ('kyo', 0.028), ('matveeva', 0.028), ('nikesh', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

2 0.27613506 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

3 0.15239531 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens

Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.

4 0.13312505 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

5 0.12492705 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

6 0.10675991 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

7 0.10246957 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

8 0.10055555 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

9 0.09689445 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

10 0.092430048 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

11 0.089896433 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

12 0.085428067 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

13 0.079036884 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

14 0.078854539 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

15 0.074777216 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

16 0.069938272 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

17 0.064165436 141 acl-2011-Gappy Phrasal Alignment By Agreement

18 0.05872504 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

19 0.058625359 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

20 0.058577135 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, -0.025), (2, 0.017), (3, 0.121), (4, 0.026), (5, -0.024), (6, 0.078), (7, 0.05), (8, 0.021), (9, -0.05), (10, 0.007), (11, -0.023), (12, 0.066), (13, -0.041), (14, 0.083), (15, -0.036), (16, 0.028), (17, -0.016), (18, 0.091), (19, -0.099), (20, 0.034), (21, -0.048), (22, 0.04), (23, -0.007), (24, -0.036), (25, 0.008), (26, -0.061), (27, 0.067), (28, 0.097), (29, -0.176), (30, 0.06), (31, -0.108), (32, 0.04), (33, -0.076), (34, 0.012), (35, 0.049), (36, 0.064), (37, 0.031), (38, 0.006), (39, -0.008), (40, -0.098), (41, 0.106), (42, 0.117), (43, 0.002), (44, 0.014), (45, -0.074), (46, 0.044), (47, -0.096), (48, 0.072), (49, 0.138)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96415949 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

2 0.84973973 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

3 0.74791765 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

4 0.64286107 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

5 0.61188412 311 acl-2011-Translationese and Its Dialects

Author: Moshe Koppel ; Noam Ordan

Abstract: While it is has often been observed that the product of translation is somehow different than non-translated text, scholars have emphasized two distinct bases for such differences. Some have noted interference from the source language spilling over into translation in a source-language-specific way, while others have noted general effects of the process of translation that are independent of source language. Using a series of text categorization experiments, we show that both these effects exist and that, moreover, there is a continuum between them. There are many effects of translation that are consistent among texts translated from a given source language, some of which are consistent even among texts translated from families of source languages. Significantly, we find that even for widely unrelated source languages and multiple genres, differences between translated texts and non-translated texts are sufficient for a learned classifier to accurately determine if a given text is translated or original.

6 0.60726213 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

7 0.56123161 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

8 0.55604565 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

9 0.54584557 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

10 0.5449416 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

11 0.51017398 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

12 0.50745094 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

13 0.49707773 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

14 0.49482536 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

15 0.48289371 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

16 0.46947226 151 acl-2011-Hindi to Punjabi Machine Translation System

17 0.44733778 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

18 0.43141568 313 acl-2011-Two Easy Improvements to Lexical Weighting

19 0.41738918 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

20 0.41319221 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.012), (17, 0.037), (20, 0.234), (26, 0.149), (37, 0.056), (39, 0.039), (41, 0.083), (53, 0.013), (55, 0.035), (59, 0.053), (72, 0.028), (88, 0.018), (91, 0.035), (96, 0.142)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78794396 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

2 0.70426971 153 acl-2011-How do you pronounce your name? Improving G2P with transliterations

Author: Aditya Bhargava ; Grzegorz Kondrak

Abstract: Grapheme-to-phoneme conversion (G2P) of names is an important and challenging problem. The correct pronunciation of a name is often reflected in its transliterations, which are expressed within a different phonological inventory. We investigate the problem of using transliterations to correct errors produced by state-of-the-art G2P systems. We present a novel re-ranking approach that incorporates a variety of score and n-gram features, in order to leverage transliterations from multiple languages. Our experiments demonstrate significant accuracy improvements when re-ranking is applied to n-best lists generated by three different G2P programs.

3 0.70085055 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

4 0.70065743 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

5 0.69581401 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

Author: Apoorv Agarwal

Abstract: In my thesis, Ipropose to build a system that would enable extraction of social interactions from texts. To date Ihave defined a comprehensive set of social events and built a preliminary system that extracts social events from news articles. Iplan to improve the performance of my current system by incorporating semantic information. Using domain adaptation techniques, Ipropose to apply my system to a wide range of genres. By extracting linguistic constructs relevant to social interactions, I will be able to empirically analyze different kinds of linguistic constructs that people use to express social interactions. Lastly, I will attempt to make convolution kernels more scalable and interpretable.

6 0.68917882 253 acl-2011-PsychoSentiWordNet

7 0.68577921 105 acl-2011-Dr Sentiment Knows Everything!

8 0.68145299 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

9 0.65455508 333 acl-2011-Web-Scale Features for Full-Scale Parsing

10 0.63236284 258 acl-2011-Ranking Class Labels Using Query Sessions

11 0.62968504 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

12 0.6189695 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

13 0.61854482 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

14 0.61457336 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

15 0.61191201 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

16 0.60972244 182 acl-2011-Joint Annotation of Search Queries

17 0.60516226 193 acl-2011-Language-independent compound splitting with morphological operations

18 0.60288858 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

19 0.6017375 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

20 0.60166478 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis