acl acl2011 acl2011-139 knowledge-graph by maker-knowledge-mining

139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations


Source: pdf

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. [sent-3, score-0.555]

2 Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. [sent-4, score-0.512]

3 In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. [sent-5, score-0.464]

4 We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. [sent-6, score-0.932]

5 Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. [sent-7, score-0.411]

6 We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain. [sent-8, score-0.304]

7 1 Introduction The growth of text corpora in different languages poses an inherent problem of aligning documents across languages. [sent-9, score-0.297]

8 Obtaining an explicit alignment, or a different way of bridging the language barrier, is an important step in many natural language processing (NLP) applications such as: document retrieval (Gale and Church, 1991 ; Rapp, 1999; Ballesteros and Croft, 1996; Munteanu and Marcu, 2005; Vu et al. [sent-10, score-0.238]

9 147 Hal Daum e´ III Raghavendra Udupa University of Maryland Microsoft Research India College Park, USA Bangalore, India hal @ umi ac s . [sent-16, score-0.09]

10 com Aligning documents from different languages arises in all the above mentioned problems. [sent-19, score-0.251]

11 In this paper, we address this problem by mapping documents into a common subspace (interlingual representation)1 . [sent-20, score-0.487]

12 This common subspace generalizes the notion of vector space model for cross-lingual applications (Turney and Pantel, 2010). [sent-21, score-0.279]

13 There are two major approaches for solving the document alignment problem, depending on the available resources. [sent-22, score-0.325]

14 The first approach, which is widely used in the Cross-lingual Information Retrieval (CLIR) literature, uses bilingual dictionaries to translate documents from one language (source) into another (target) language (Ballesteros and Croft, 1996; Pirkola et al. [sent-23, score-0.445]

15 Then standard measures such as cosine similarity are used to identify target language documents that are close to the translated document. [sent-25, score-0.208]

16 The second approach is to use training data of aligned document pairs to find a common subspace such that the aligned document pairs are maximally correlated (Susan T. [sent-26, score-0.753]

17 , each source language document is translated independently of other documents. [sent-35, score-0.18]

18 Moreover, after translation, the relationship of a given source document with the rest of the source documents is ignored. [sent-36, score-0.388]

19 On the other hand, supervised approaches use all the source and target language documents to infer an interlingual 1We use the phrases “common subspace” and “interlingual representation” interchangeably. [sent-37, score-0.548]

20 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 147–152, representation, but their strong dependency on the training data prevents them from generalizing well to test documents from a different domain. [sent-40, score-0.208]

21 At a broad level, our approach uses bilingual dictionaries to identify initial noisy document alignments (Sec. [sent-42, score-0.618]

22 1) and then uses these noisy alignments as training data to learn a common subspace. [sent-44, score-0.239]

23 Since the alignments are noisy, we need a learning algorithm that is robust to the errors in the training data. [sent-45, score-0.118]

24 , 2009) and develop a supervised variant of it (Sec. [sent-48, score-0.142]

25 Our supervised variant learns to modify the within language document similarities according to the given alignments. [sent-51, score-0.491]

26 Since the original algorithm is unsupervised, we hope that its supervised variant is tolerant to errors in the candidate alignments. [sent-52, score-0.188]

27 The primary advantage of our method is that, it does not use any training data and thus generalizes to test documents from different domains. [sent-53, score-0.208]

28 And unlike the dictionary based approaches, we use all the documents in computing the common subspace and thus achieve better accuracies compared to the approaches which translate documents in isolation. [sent-54, score-0.907]

29 First, we propose a discriminative technique to learn an interlingual representation using only a bilingual dictionary. [sent-56, score-0.464]

30 Second, we develop a supervised variant of Kernelized Sorting algorithm (Quadrianto et al. [sent-57, score-0.142]

31 , 2009) which learns to modify within language doc- ument similarities according to a given alignment. [sent-58, score-0.221]

32 2 Approach Given a cross-lingual corpus, with an underlying unknown document alignment, we propose a technique to recover the hidden alignment. [sent-59, score-0.263]

33 This is achieved by mapping documents into an interlingual representation. [sent-60, score-0.455]

34 In the first stage, we use a bilingual dictionary to find initial candidate noisy document alignments. [sent-62, score-0.65]

35 The second stage uses a robust learning algorithm to learn a common subspace from the noisy alignments identified in the first step. [sent-63, score-0.517]

36 Subsequently, we project all 148 the documents into the common subspace and use maximal matching to recover the hidden alignment. [sent-64, score-0.57]

37 During this stage, we also learn mappings from the document spaces onto the common subspace. [sent-65, score-0.342]

38 These mappings can be used to convert any new document into the interlingual representation. [sent-66, score-0.551]

39 1 Noisy Document Alignments Translating documents from one language into another language and finding the nearest neighbours gives potential alignments. [sent-73, score-0.255]

40 Unfortunately, the resulting alignments may differ depending on the direction of the translation owing to the asymmetry of bilingual dictionaries and the nearest neighbour property. [sent-74, score-0.512]

41 In order to overcome this asymmetry, we first turn the documents in both languages into bag of translation pairs representation. [sent-75, score-0.39]

42 Each translation pair of the bilingual dictionary (also referred as a dictionary entry) is treated as a new feature. [sent-77, score-0.58]

43 Given a document, every word is replaced with the set of bilingual dictionary entries that it participates in. [sent-78, score-0.397]

44 If D represents the TFIDF weighted term docuDmen rte pmreastreinxt san tdh eT T TisF a binary gmhatterdix t emrmatrix × o dfo osciuzeno of dictionary entries vocab size, then converting dfo dciuctmioennatsr yin etnot a bag oofc dictionary nen ctornievse tisgiven by the linear operation ← TD. [sent-79, score-0.554]

45 2 After converting the documents← ←in TtoD bag of dictionary entries representation, we form a bipartite X(t) graph with the documents of each language as a separate set of nodes. [sent-80, score-0.575]

46 The edge weight Wij be- xi(t) yj(t) tween a pair of documents and (in source and target language respectively) is computed as the Euclidean distance between those documents in the dictionary space. [sent-81, score-0.588]

47 Let πij indicate the likeliness of a source document xi(t) is aligned to a target doc- yj(t). [sent-82, score-0.237]

48 ument We want each document to align to at least one document from other language. [sent-83, score-0.448]

49 Moreover, we want to encourage similar documents to align to each other. [sent-84, score-0.244]

50 We can formulate this objective and the constraints as the following minimum cost flow 2Superscript (t) indicates that the data is in the form of bag of dictionary entries problem (Ravindra et al. [sent-85, score-0.349]

51 , 1993): argmπini,Xmj=,n1Wijπij ∀iXπij = 1; ∀jXπij = 1 Xj ∀i,j Xi 0 ≤ πij ≤ C (1) where C is some user chosen constant, m and n are the number of documents in source and target languages respectively. [sent-86, score-0.251]

52 2 Supervised Kernelized Sorting Kernelized Sorting is an unsupervised technique to align objects of different types, such as English and Spanish documents (Quadrianto et al. [sent-95, score-0.244]

53 The main advantage of this method is that it only uses the intra-language document similarities to identify the alignments across languages. [sent-98, score-0.369]

54 In this section, we describe a supervised × variant of Kernelized Sorting which takes a set of candidate alignments and learns to modify the intralanguage document similarities to respect the given alignment. [sent-99, score-0.655]

55 Since Kernelized Sorting does not rely on the inter-lingual document similarities at all, we hope that its supervised version is robust to noisy alignments. [sent-100, score-0.427]

56 Let Π ∈ {0, 1}m×n denote the permutation matrix Lwehtic Πh captures the alignment between documents of different languages, i. [sent-104, score-0.397]

57 πij = 1 indicates documents xi and yj are aligned. [sent-106, score-0.3]

58 , 2005): argmΠax = argmΠax tr(KxΠKyΠT) tr(XTX Π YTY ΠT) (2) (3) In our supervised version of Kernelized Sorting, we fix the permutation matrix (to say Πˆ) and modify the kernel matrices Kx and Ky so that the objective function is maximized for the given permutation. [sent-108, score-0.362]

59 Specifically, we find a mapping for each language, such that when the documents are projected into their common subspaces they are more likely to respect the alignment given by Πˆ. [sent-109, score-0.302]

60 Subsequently, the test documents are also projected into the common subspace and we return the nearest neighbors as the aligned pairs. [sent-110, score-0.591]

61 Let U and V be the mappings for the required subspace in both the languages, then we want to solve the following optimization problem: argmUa,Vx tr(XTUUTXΠˆ YTV VTYΠˆT) s. [sent-111, score-0.467]

62 UTU = I VTV = I & (4) where I an identity matrix of appropriate size. [sent-113, score-0.086]

63 For is brevity, let Cxy denote the cross-covariance matrix (i. [sent-114, score-0.086]

64 UTU = I VTV = I & (5) We have used the cyclic property of the trace function while rewriting Eq. [sent-118, score-0.122]

65 Similarly, fixing U (to U0) and solving the optimization problem for V results: CTxyU0U0TCxy V = λv V (7) In the special case where both V0V0T and U0U0T are identity matrices, the above equations reduce to CxyCxTy U = λu U and CxTyCxy V = λv V . [sent-123, score-0.169]

66 In this particular case, we can simultaneously solve for both U and V using Singular Value Decomposition (SVD) as: USVT = Cxy (8) So for the first iteration, we do the SVD of the crosscovariance matrix and get the mappings. [sent-124, score-0.148]

67 For the subsequent iterations, we use the mappings found by the previous iteration, as U0 and V0, and solve Eqs. [sent-125, score-0.186]

68 3 Summary In this section, we describe our procedure to recover document alignments. [sent-128, score-0.263]

69 We first convert documents into bag of dictionary entries representation (Sec. [sent-129, score-0.556]

70 We use the LEMON3 graph library to solve the min-cost flow problem. [sent-134, score-0.111]

71 This step gives us the πij values for every cross-lingual document pair. [sent-135, score-0.18]

72 We use them to form a relaxed permutation matrix (Πˆ) which is, subsequently, used to find the mappings (U and V ) for the documents of both the languages (i. [sent-136, score-0.508]

73 We use these mappings to project both source and target language documents into the common subspace and then solve the bipartite matching problem to recover the alignment. [sent-140, score-0.823]

74 3 Experiments For evaluation, we choose 2500 aligned document pairs from Wikipedia in English-Spanish and English-German language pairs. [sent-141, score-0.237]

75 Subsequently we convert the documents into TFIDF weighted vectors. [sent-145, score-0.208]

76 The bilingual dictionaries for both the language pairs are generated by running Giza++ (Och and Ney, 2003) on the Europarl data (Koehn, 2005). [sent-146, score-0.237]

77 hu/trac/lemon 150 Wikipedia documents in English-Spanish and EnglishGerman language pairs. [sent-150, score-0.208]

78 For CCA, we regularize the within language covariance matrices as (1−λ)XXT+λI awnitdh tinhel regularization parameter cλe svaalsue (1 i−s aλls)oX sXhown. [sent-151, score-0.126]

79 We compare our approach with a dictionary based approach, such as word-by-word translation, and supervised approaches, such as CCA (Vinokourov et al. [sent-155, score-0.265]

80 Word-by-word translation and our approach use bilingual dictionary while CCA and OPCA use a training corpus of aligned documents. [sent-158, score-0.465]

81 Since the bilingual dictionary is learnt from Europarl data set, for a fair comparison, we train supervised approaches on 3000 document pairs from Europarl data set. [sent-159, score-0.614]

82 For all the systems, we construct a bipartite graph between the documents of different languages, with edge weight being the cross-lingual similarity given by the respective method and then find maximal matching (Jonker and Volgenant, 1987). [sent-164, score-0.275]

83 For comparison purposes, we trained and tested CCA on documents from same domain (Wikipedia). [sent-167, score-0.208]

84 For both the language pairs, our model performed better than word-byword translation method and competitively with the × supervised approaches. [sent-170, score-0.16]

85 From the results, we see that solving a relaxed version of the problem gives better accuracies but the improvements are marginal (especially for English-German). [sent-175, score-0.129]

86 Thus, the improved performance of our system compared to word-by-word translation shows the effectiveness of the supervised Kernelized sorting. [sent-179, score-0.16]

87 Except, we use a cross-covariance matrix instead of a term document matrix. [sent-182, score-0.266]

88 Efficmiaetnrti algorithms efx ais tet rfomr solving mSVenDt on arbitrarily large matrices, which makes our approach scalable to large data sets (Warmuth and Kuzmin, 2006). [sent-183, score-0.089]

89 8, the mappings U and V can be improved by iteratively solving the Eqs. [sent-185, score-0.213]

90 But it leads the mappings to fit the noisy alignments exactly, so in this paper we stop after solving the SVD problem. [sent-187, score-0.414]

91 The extension of our approach to the situation with different number of documents on each side is straight forward. [sent-188, score-0.208]

92 In this case, the input to the bipartite matching problem is modified by adding dummy documents to the language that has fewer documents and assigning a very high score to edges that connect to the dummy documents. [sent-190, score-0.557]

93 5 Conclusion In this paper we have presented an approach to recover document alignments from a comparable corpora using a bilingual dictionary. [sent-191, score-0.55]

94 First, we use the bilingual dictionary to find a set of candidate noisy alignments. [sent-192, score-0.47]

95 These noisy alignments are then fed into supervised Kernelized Sorting, which learns to modify within language document similarities to respect 151 the given alignments. [sent-193, score-0.643]

96 The first step uses cross-lingual cues available in the form of a bilingual dictionary and the latter step exploits document structure captured in terms of within language document similarities. [sent-195, score-0.701]

97 Experimental results show that our approach performs better than dictionary based approaches such as a wordby-word translation and is also competitive with supervised approaches like CCA and OPCA. [sent-196, score-0.332]

98 Name translation in statistical machine translation learning when to transliterate. [sent-246, score-0.134]

99 Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. [sent-271, score-0.174]

100 Feature-based method for document alignment in comparable news corpora. [sent-362, score-0.236]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernelized', 0.313), ('interlingual', 0.247), ('sorting', 0.241), ('subspace', 0.241), ('cca', 0.235), ('documents', 0.208), ('document', 0.18), ('dictionary', 0.172), ('bilingual', 0.169), ('ij', 0.14), ('mappings', 0.124), ('alignments', 0.118), ('quadrianto', 0.096), ('daum', 0.096), ('supervised', 0.093), ('jagadeesh', 0.09), ('hal', 0.09), ('solving', 0.089), ('cxy', 0.089), ('opca', 0.089), ('ravindra', 0.089), ('udupa', 0.089), ('vinokourov', 0.089), ('matrix', 0.086), ('kx', 0.086), ('noisy', 0.083), ('recover', 0.083), ('ky', 0.082), ('matrices', 0.081), ('jonker', 0.078), ('gretton', 0.078), ('platt', 0.076), ('ballesteros', 0.072), ('bag', 0.072), ('europarl', 0.072), ('similarities', 0.071), ('dictionaries', 0.068), ('translation', 0.067), ('bipartite', 0.067), ('solve', 0.062), ('asis', 0.059), ('bousquet', 0.059), ('jagaralmudi', 0.059), ('pirkola', 0.059), ('raghavendra', 0.059), ('utu', 0.059), ('vtv', 0.059), ('warmuth', 0.059), ('yty', 0.059), ('jagarlamudi', 0.059), ('argm', 0.059), ('tfidf', 0.059), ('retrieval', 0.058), ('aligned', 0.057), ('stroudsburg', 0.057), ('entries', 0.056), ('alignment', 0.056), ('modify', 0.055), ('svd', 0.054), ('yj', 0.054), ('xtx', 0.052), ('volgenant', 0.052), ('barrier', 0.052), ('rai', 0.052), ('ument', 0.052), ('subsequently', 0.052), ('variant', 0.049), ('flow', 0.049), ('representation', 0.048), ('hermjakob', 0.048), ('olivier', 0.048), ('vu', 0.048), ('gao', 0.048), ('permutation', 0.047), ('nearest', 0.047), ('aligning', 0.046), ('candidate', 0.046), ('trace', 0.045), ('umd', 0.045), ('regularize', 0.045), ('tr', 0.045), ('transliteration', 0.043), ('languages', 0.043), ('asymmetry', 0.043), ('mimno', 0.043), ('learns', 0.043), ('munteanu', 0.041), ('dfo', 0.041), ('optimization', 0.04), ('accuracies', 0.04), ('cyclic', 0.04), ('fixing', 0.04), ('neural', 0.039), ('common', 0.038), ('xi', 0.038), ('multilingual', 0.038), ('dummy', 0.037), ('rewriting', 0.037), ('stage', 0.037), ('align', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

2 0.24802259 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

3 0.15929359 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

4 0.13312505 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

5 0.13118073 1 acl-2011-(11-06-spirl)

Author: (hal)

Abstract: unkown-abstract

6 0.10444549 115 acl-2011-Engkoo: Mining the Web for Language Learning

7 0.10317253 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

8 0.10121407 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

9 0.099623084 204 acl-2011-Learning Word Vectors for Sentiment Analysis

10 0.096517809 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

11 0.092551127 276 acl-2011-Semi-Supervised SimHash for Efficient Document Similarity Search

12 0.092097968 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

13 0.087472498 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

14 0.083882168 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

15 0.08374884 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

16 0.081558846 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

17 0.079409048 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

18 0.077325329 245 acl-2011-Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives

19 0.076754354 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

20 0.076675214 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.222), (1, -0.02), (2, 0.024), (3, 0.156), (4, 0.044), (5, -0.045), (6, 0.03), (7, 0.069), (8, 0.005), (9, 0.056), (10, 0.078), (11, 0.042), (12, 0.08), (13, 0.006), (14, 0.026), (15, -0.0), (16, 0.089), (17, 0.012), (18, 0.072), (19, -0.141), (20, 0.009), (21, -0.088), (22, 0.043), (23, 0.033), (24, -0.092), (25, -0.019), (26, -0.108), (27, 0.031), (28, 0.072), (29, -0.129), (30, 0.069), (31, 0.029), (32, 0.104), (33, -0.026), (34, -0.056), (35, -0.006), (36, -0.045), (37, 0.072), (38, -0.007), (39, 0.047), (40, -0.102), (41, 0.069), (42, 0.027), (43, 0.014), (44, 0.061), (45, -0.059), (46, -0.025), (47, -0.038), (48, 0.1), (49, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94996876 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

2 0.77422607 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

3 0.70985872 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

4 0.65462679 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

5 0.65318459 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

6 0.63582522 311 acl-2011-Translationese and Its Dialects

7 0.60631251 1 acl-2011-(11-06-spirl)

8 0.58431256 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

9 0.55055952 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

10 0.54055518 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

11 0.53574628 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

12 0.52639461 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

13 0.512263 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features

14 0.50539219 303 acl-2011-Tier-based Strictly Local Constraints for Phonology

15 0.49343348 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

16 0.49165672 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

17 0.49054408 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity

18 0.489647 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

19 0.48795274 151 acl-2011-Hindi to Punjabi Machine Translation System

20 0.48415267 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(17, 0.026), (26, 0.025), (37, 0.077), (39, 0.037), (41, 0.537), (55, 0.027), (59, 0.022), (72, 0.03), (91, 0.029), (96, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95198309 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport

Author: Siwei Wang ; Gina-Anne Levow

Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.

2 0.94646132 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction

Author: Yu Hong ; Jianfeng Zhang ; Bin Ma ; Jianmin Yao ; Guodong Zhou ; Qiaoming Zhu

Abstract: Event extraction is the task of detecting certain specified types of events that are mentioned in the source language data. The state-of-the-art research on the task is transductive inference (e.g. cross-event inference). In this paper, we propose a new method of event extraction by well using cross-entity inference. In contrast to previous inference methods, we regard entitytype consistency as key feature to predict event mentions. We adopt this inference method to improve the traditional sentence-level event extraction system. Experiments show that we can get 8.6% gain in trigger (event) identification, and more than 11.8% gain for argument (role) classification in ACE event extraction. 1

3 0.92820102 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars

Author: Antske Fokkens

Abstract: When designing grammars of natural language, typically, more than one formal analysis can account for a given phenomenon. Moreover, because analyses interact, the choices made by the engineer influence the possibilities available in further grammar development. The order in which phenomena are treated may therefore have a major impact on the resulting grammar. This paper proposes to tackle this problem by using metagrammar development as a methodology for grammar engineering. Iargue that metagrammar engineering as an approach facilitates the systematic exploration of grammars through comparison of competing analyses. The idea is illustrated through a comparative study of auxiliary structures in HPSG-based grammars for German and Dutch. Auxiliaries form a central phenomenon of German and Dutch and are likely to influence many components of the grammar. This study shows that a special auxiliary+verb construction significantly improves efficiency compared to the standard argument-composition analysis for both parsing and generation.

4 0.91919738 189 acl-2011-K-means Clustering with Feature Hashing

Author: Hajime Senuma

Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.

same-paper 5 0.9187932 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

6 0.91706574 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing

7 0.9124276 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars

8 0.902035 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers

9 0.83309561 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

10 0.6352818 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

11 0.63489771 94 acl-2011-Deciphering Foreign Language

12 0.62015623 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

13 0.61717266 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

14 0.60008878 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

15 0.59240955 135 acl-2011-Faster and Smaller N-Gram Language Models

16 0.58978045 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

17 0.57936859 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

18 0.57610065 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

19 0.57201183 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

20 0.56782466 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing