emnlp emnlp2010 emnlp2010-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie
Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.
Reference: text
sentIndex sentText sentNum sentScore
1 com , , Abstract This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. [sent-4, score-0.311]
2 Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. [sent-5, score-0.479]
3 In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. [sent-6, score-0.543]
4 To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. [sent-8, score-0.609]
5 This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. [sent-9, score-0.304]
6 1 Introduction Entity translation aims at mapping the entity names (e. [sent-11, score-0.291]
7 While high quality entity translation is essential in cross-lingual information access and trans∗This work was done when the first two authors visited Microsoft Research Asia. [sent-14, score-0.201]
8 430 lation, it is non-trivial to achieve, due to the challenge that entity translation, though typically bearing pronunciation similarity, can also be arbitrary, e. [sent-15, score-0.214]
9 Transliteration-based approaches (Wan and Verspoor, 1998; Knight and Graehl, 1998) identify translations based on pronunciation similar- ity, while corpus-based approaches mine bilingual co-occurrences of translation pairs obtained from parallel (Kupiec, 1993; Feng et al. [sent-20, score-0.624]
10 , 2004) or comparable (Fung and Yee, 1998) corpora, or alternatively mined from bilingual sentences (Lin et al. [sent-21, score-0.164]
11 These two approaches have complementary strength– transliteration-based similarity can be computed for any name pair but cannot mine translations of little (or none) phonetic similarity. [sent-24, score-0.485]
12 Corpus-based similarity can support arbitrary translations, but require highly scarce resources of bilingual co-occurrences, obtained from parallel or comparable bilingual corpora. [sent-25, score-0.621]
13 In this paper, we propose a holistic approach, leveraging both transliteration- and corpus-based similarity. [sent-26, score-0.179]
14 Our key contribution is to replace the use of scarce resources of bilingual co-occurrences with the use of untapped and significantly larger resources of monolingual co-occurrences for translation. [sent-27, score-0.35]
15 In particular, we extract monolingual cooccurrences of entities from English and Chinese Web corpora, which are readily available from entity search engines such as PeopleEntityCube1 , deployed by Microsoft Research Asia. [sent-28, score-0.34]
16 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 4t3ic0s–439, automatically extracts people names from text and their co-occurrences to retrieve related entities based on co-occurrences. [sent-33, score-0.191]
17 In particular, we propose a novel approach of abstracting entity translation as a graph matching problem of two graphs Ge and Gc in Figures 1(a) and (b). [sent-38, score-0.566]
18 Specifically, the similarity between two nodes ve and vc in Ge and Gc is initialized as their transliteration similarity, which is iteratively refined based on relational similarity obtained from monolingual cooccurrences. [sent-39, score-1.187]
19 Oshnicpes th beestwe two bilingual graphs dof “ people a·n? [sent-50, score-0.345]
20 d their relationships are harvested, entity translation can leverage these parallel relationships to further evidence the mapping between translation pairs, as Figure 1(c) illustrates. [sent-51, score-0.351]
21 In particular, Figure 2 reports the precision for two groups– “heads” that belong to top-100 popular people (determined by the number of hits), among randomly sampled 304 people names5 from six graph pairs of size 1,000 each, and the remaining “tails”. [sent-53, score-0.243]
22 431 Figure 2: Comparison for Head and Tail datasets bilingual co-occurrences that are scarce for tails, show significantly lower precision for tails. [sent-61, score-0.298]
23 Meanwhile, our work, depending solely on monolingual co-occurrences, shows high precision, for both heads and tails. [sent-62, score-0.171]
24 Our focus is to boost translation accuracy for long tails with non-trivial Web occurrences in each monolingual corpus, but not with much bilingual cooccurrences, e. [sent-63, score-0.452]
25 As existing translators are already highly accurate for popular heads, this focus well addresses the remaining challenges for entity translation. [sent-66, score-0.158]
26 To summarize, we believe that this paper has the following contributions: • • • We abstract entity translation problem as a graph mapping y betw treaensnl entity-relationship graphs in two languages. [sent-67, score-0.428]
27 We develop an effective matching algorWithem leveraging both pronunciation and cooccurrence similarity. [sent-68, score-0.317]
28 This holistic approach complements existing approaches and enhances the translation coverage and accuracy. [sent-69, score-0.278]
29 However, people are free to designate arbitrary bilingual names of little (or none) phonetic similarity, for which the transliteration-based approach is not effective. [sent-80, score-0.373]
30 2 Corpus-based Approaches Corpus-based approach can mine arbitrary translation pairs, by mining bilingual co-occurrences from parallel and comparable bilingual corpora. [sent-82, score-0.594]
31 , bilingual Wikipedia entries on the same person, renders high accuracy but suffers from high scarcity. [sent-86, score-0.164]
32 , 2008) extracts bilingual cooccurrences from bilingual sentences, such as annotating terms with their corresponding translations in English inside parentheses. [sent-91, score-0.484]
33 , 2009) identifies potential translation pairs from bilingual sentences using lexical pattern analysis. [sent-93, score-0.313]
34 3 Holistic Approaches The complementary strength of the above two approaches naturally calls for a holistic approach, such as recent work combining transliterationand corpus-based similarity mining bilingual cooccurrences using general search engines. [sent-95, score-0.578]
35 Specifically, (Al-Onaizan and Knight, 2002) uses transliteration to generate candidates and then web corpora × to identify translations. [sent-96, score-0.282]
36 , 2007) enhances to use transliteration to guide web mining. [sent-98, score-0.242]
37 Our work is also a holistic approach, but leveraging significantly larger corpora, specifically by exploiting monolingual co-occurrences. [sent-99, score-0.301]
38 Such expansion enables to translate “long-tail” people entities with non-trivial Web occurrences in each monolingual corpus, but not much bilingual co-occurrences. [sent-100, score-0.387]
39 Specifically, we initialize name pair similarity using transliteration-based approach, and iteratively reinforces base similarity using relational similarity. [sent-101, score-0.598]
40 3 Our Framework Given two graphs Ge = (Ve, Ee) and Gc = (Vc, Ec) harvested from English and Chinese corpora respectively, our goal is to find translation pairs, or a set S of matching node pairs such that S ⊆ Ve Vc. [sent-102, score-0.499]
41 Let oRf m bea a |Ve |-by-|Vc| mirsatr siuxc hw thhearte S Sea ⊆ch V Rij d Venotes tRhe b similarity yb e-|tVwe|e mn atwtriox n wohdeesre ei ∈ Ve Rand j ∈ Vc. [sent-103, score-0.335]
42 Initialization: computing base translation sim- ilarities Rij between two entity nodes using transliteration similarity 2. [sent-105, score-0.689]
43 Matching extraction: extracting the matching pairs from the final translation similarities Rij 3. [sent-107, score-0.287]
44 1 Initialization with Transliteration We initialize the translation similarity Rij as the transliteration similarity. [sent-108, score-0.474]
45 This section explains how to get the transliteration similarity between English and Chinese names using an unsupervised approach. [sent-109, score-0.465]
46 Formally, let an English name Ne = (e1, e2 , · · · , en) and a Chinese name Nc = (c1, c2 , · · · , cm) be given, where ei is an English word a,n··d· Ne is a sequence of the words, and ci is a Chinese character and Nc is a sequence of the characters. [sent-110, score-0.335]
47 Our goal is to compute a score indicating the similarity between the pronunciations of the two names. [sent-111, score-0.226]
48 Calculation of transliteration similarity between Ne and Nc is now transformed to calculation of pronunciation similarity between Ne and PYc. [sent-118, score-0.665]
49 Because letters in Chinese Pinyins and English strings are pronounced similarly, we can further approximate pronunciation similarity between Ne and PYc using their spelling similarity. [sent-119, score-0.29]
50 , EDname(Ne, PYc), as the sum of the EDs of all component transliteration pairs, i. [sent-123, score-0.197]
51 , every ei in Ne and its corresponding transliteration (si) in PYc. [sent-125, score-0.354]
52 In other words, we need to first align all sj ’s in PYc with corresponding ei in Ne based on whether they are translations of each other. [sent-126, score-0.318]
53 EDname(Ne, PYc) = ∑ED(ei, esi) (1) ∑i where esi is a string generated by concatenating all si’s that are aligned to ei and ED(ei, esi) is the Edit Distance between ei and esi, i. [sent-128, score-0.391]
54 , the minimum number of edit operations (including insertion, deletion and substitution) needed to transform ei into esi. [sent-130, score-0.196]
55 Because an English word usually consists of multiple syllables but every Chinese character consists of only one syllable, when aligning ei’s with sj ’s, we add the constraint that each ei is allowed to be aligned with 0 to 4 si’s but each si can only be aligned with 0 to 1 ei. [sent-131, score-0.226]
56 For instance, the minimal Edit Distance between the English name “Barack Obama” and the Chinese Pinyin representation “ba la ke ao ba ma” is 4, as the best alignment is: “Barack” ↔ “ba la ke” (ED: 3), “bOesbta amliga”n ↔ “nato i sb:a “ “mBaa”r (ED: 1). [sent-133, score-0.207]
57 Finally teh”e ( tEraDn:sl 3it),e“rOabtiaomn similarity a bemtwa”ee(nE Nc )a. [sent-134, score-0.178]
58 2 Reinforcement Model From the initial similarity, we model our problem as an iterative approach that iteratively reinforces the similarity Rij ofthe nodes iand j from the matching similarities of their neighbor nodes u and v. [sent-140, score-0.753]
59 The basic intuition is built on exploiting the similarity between monolingual co-occurrences of two different languages. [sent-141, score-0.3]
60 In order to express this intuition, we formally define an iterative reinforcement model as follows. [sent-143, score-0.211]
61 Let Ritj denote the similarity of nodes iand j at t-th iteration: Ritj+1= λ(u,v)k∑∈Bt(i,j,θ)R2utkv+ (1 − λ)Ri0j (3) The model is expressed as a linear combination of (a) the relational similarity ∑ Rutv/2k and (b) transliteration similarity Ri0j. [sent-144, score-0.956]
62 ) In the relational similarity, Bt(i, j,θ) is an ordered set of the best matching pairs between neighbor nodes of iand ones of j such that ∀(u, v)k ∈ bBotr(i n, j,θ) , oRfut iv ≥ oθ,n ewsh oefre j (u, v)k aist t∀h(eu ,mv)atch∈ing pair with k-th≥ highest similarity score. [sent-146, score-0.649]
63 We consider (u, v) with similarity over some threshold θ, or Rutv ≥ θ, as a matching pair. [sent-147, score-0.316]
64 In this neighbor matching process, imf many-to-many nm tahtisch neesi exist, we select only one with the greatest matching score. [sent-148, score-0.334]
65 N(i) and N(j) are the sets of neighbor nodes of iand j, respectively, and H is a priority queue sorting pairs in the decreasing order of similarity scores. [sent-150, score-0.455]
66 Meanwhile, note that, in order to express that the confidence for matching (i, j) progressively converges as the number of matched neighbors increases, we empirically use decaying coefficient 1/2k for Rutv, because ∑k∞=1 1/2k = 1. [sent-151, score-0.256]
67 3 Matching Extrac∑tion After the convergence of the above model, we get the |Ve |-by-|Vc| similarity matrix R∞. [sent-153, score-0.224]
68 More formally, this problem can be stated as the maximum weighted bipartite matching (West, 1. [sent-155, score-0.194]
69 return Bt (i, j,θ) Figure 3: How to get the ordered set Bt (i, j,θ) 2000)– Given two groups of entities Ve and Vc from the two graphs Ge and Gc, we can build a weighted bipartite graph is G = (V, E), where V = Ve ∪ Vc and E is a set of edges (u, v) with weight Ru∞v. [sent-170, score-0.335]
70 From this bipartite graph, the maximum weighted bipartite matching problem finds a set of pairwise non-adjacent edges S ⊆ E fsiuncdhs tah sate ∑(u,v)∈S Ru∞v ins- athdjea cmenatx eimdugems. [sent-172, score-0.25]
71 Second, we validate the effectiveness and the scalability of our approach over a real-life dataset in Section 4. [sent-178, score-0.229]
72 Specifically, we built a graph pairs (Ge, Gc) expanding from a “seed pair” of nodes se ∈ Ve and sc ∈ Vc until the number of nodes for each∈ graph beco∈m Ves 1,0006. [sent-183, score-0.522]
73 We then iteratively pop a node ve from Q, save ve into Ve, and push its neighbor nodes in decreasing order of co-occurrence scores with ve. [sent-186, score-0.507]
74 By using this procedure, we built six graph pairs from six different seed pairs. [sent-188, score-0.24]
75 In particular, the six seed nodes are English names and its corresponding Chinese names representing a wide range of occupation domains (e. [sent-189, score-0.388]
76 Meanwhile, though we demonstrate the effectiveness of the proposed method for mining name translations in Chinese and English languages, the method can be easily adapted to other language pairs. [sent-192, score-0.327]
77 Table 1: Summary for graphs and test datasets obtained from each seed pair 462i315|V 1e, |0 ,0 | 0Vc|54T 71230i|EBranCitEgauhnilcmerksoyhGinOSPNaebpmt naeums lraesYfCi}. [sent-193, score-0.297]
78 [mije Second, we manually searched for about 50 “ground-truth” matched translations for each graph pair to build test datasets Ti, by randomly selecting nodes within two hops7 from the seed pair (se, sc), since nodes outside two hops may include nodes whose neighbors are not fully crawled. [sent-199, score-0.854]
79 More specifically, due to our crawling process expanding to add neighbors from the seed, the nodes close to the seed have all the neighbors they would have in the full graph, while those far from the node may not. [sent-200, score-0.451]
80 In order to pick the nodes that well represent the actual 6Note, this is just a default setting, which we later increase for scalability evaluation in Figure 6. [sent-201, score-0.199]
81 For each graph pair, we evaluated the effectiveness of (1) reinforcement model using MRR measure in Section 4. [sent-251, score-0.392]
82 We also validated (3) scalability of our framework over larger scale of graphs (with up to five thousand nodes) in Section 4. [sent-256, score-0.218]
83 By comparing 436 MRRs for two matrices R0 and R∞, we can validate the effectiveness of the reinforcement model. [sent-264, score-0.354]
84 • • Baseline matrix (R0): using only the translitBeraastieolnin similarity score, i. [sent-265, score-0.224]
85 , without reinforcement Reinforced matrix (R∞): using the reinforced similarity score roixbta (Rined from Equation (3) Table 2: MRR of baseline and reinforced matrices SetBaseline R0MRRReinforced R∞ AvT er152643ag0 . [sent-267, score-0.589]
86 As these figures show, MRR scores significantly increase after applying our reinforcement model except for the set T4 (on average from 69% to 81%), which indirectly shows the effectiveness of our reinforcement model. [sent-272, score-0.508]
87 We compared our approach with a baseline, mapping two graphs with only transliteration similarity. [sent-276, score-0.329]
88 • • Baseline: in matching extraction, using R0 as tBhaes similarity amtcathriinxg by bypassing nthge reinforcement step Ours: using R∞, the similarity matrix converged by Equation (3) )RG(VAMR0 0. [sent-277, score-0.751]
89 Meanwhile, in order to validate the translation ac- curacy over popular head and long-tail, as discussed in Section 1, we separated the test data into two groups and analyzed the effectiveness separately. [sent-304, score-0.242]
90 Number of names Figure 5: Distribution over number of hits Table 4 shows the effectiveness with both datasets, respectively. [sent-307, score-0.24]
91 As difference of the effectiveness between tail and head (denoted diff) with respect to three measures shows, our approach shows stably high precision, for both heads and tails. [sent-308, score-0.191]
92 3 Scalability To validate the scalability of our approach, we evaluated the effectiveness of our approach over the number of nodes in two graphs. [sent-311, score-0.342]
93 We built larger six graph pairs (Ge, Gc) by expanding them from the seed pairs further until the number of nodes reaches 5,000. [sent-312, score-0.459]
94 Figure 6 shows the number of matched translations according to such increase. [sent-313, score-0.159]
95 Overall, the number of matched pairs linearly increases as the number of nodes increases, which suggests scalability. [sent-314, score-0.23]
96 The ratio of node overlap in two graphs is about between 7% and 9% of total node size. [sent-315, score-0.212]
97 |Ve| and |Vc| Figure 6: Matched translations over |Ve | and |Vc | 5 Conclusion This paper abstracted name translation problem as a matching problem of two entity-relationship graphs. [sent-316, score-0.456]
98 This novel approach complements existing name translation work, by not requiring rare resources of parallel or comparable corpus yet outperforming the state-of-the-art. [sent-317, score-0.306]
99 More specifically, we combine bilingual phonetic similarity and monolingual Web co-occurrence similarity, to compute a holistic notion of entity similarity. [sent-318, score-0.748]
100 01 795 142 8 Table 4: Precision, Recall, and F1-score of Engkoo, Google, and Ours with head and tail datasets veloped a graph alignment algorithm that iteratively reinforces the matching similarity exploiting relational similarity and then extracts correct matches. [sent-349, score-0.868]
wordName wordTfidf (topN-words)
[('edname', 0.246), ('ne', 0.219), ('reinforcement', 0.211), ('pyc', 0.211), ('transliteration', 0.197), ('similarity', 0.178), ('vc', 0.175), ('gc', 0.173), ('bilingual', 0.164), ('ei', 0.157), ('matching', 0.138), ('pinyin', 0.134), ('ge', 0.133), ('graphs', 0.132), ('ve', 0.128), ('mrr', 0.128), ('monolingual', 0.122), ('nodes', 0.113), ('pronunciation', 0.112), ('bt', 0.112), ('holistic', 0.112), ('chinese', 0.106), ('rij', 0.104), ('entity', 0.102), ('translation', 0.099), ('ed', 0.098), ('barack', 0.096), ('nc', 0.096), ('seed', 0.095), ('graph', 0.095), ('translations', 0.092), ('names', 0.09), ('name', 0.089), ('scalability', 0.086), ('effectiveness', 0.086), ('esi', 0.077), ('reinforced', 0.077), ('meanwhile', 0.077), ('datasets', 0.07), ('phonetic', 0.07), ('ba', 0.07), ('sj', 0.069), ('leveraging', 0.067), ('complements', 0.067), ('engkoo', 0.067), ('pycj', 0.067), ('rutv', 0.067), ('tails', 0.067), ('matched', 0.067), ('gates', 0.064), ('cooccurrences', 0.064), ('hits', 0.064), ('scarce', 0.064), ('mining', 0.06), ('neighbor', 0.058), ('reinforces', 0.057), ('validate', 0.057), ('ming', 0.056), ('translators', 0.056), ('tail', 0.056), ('mine', 0.056), ('bipartite', 0.056), ('obama', 0.056), ('iand', 0.056), ('feng', 0.056), ('expanding', 0.056), ('relational', 0.056), ('microsoft', 0.052), ('entities', 0.052), ('jiang', 0.051), ('google', 0.051), ('neighbors', 0.051), ('parallel', 0.051), ('pairs', 0.05), ('people', 0.049), ('heads', 0.049), ('bill', 0.049), ('pronunciations', 0.048), ('len', 0.048), ('ke', 0.048), ('matrix', 0.046), ('web', 0.045), ('crawled', 0.045), ('crawling', 0.045), ('hops', 0.045), ('pohang', 0.045), ('ritj', 0.045), ('shao', 0.045), ('simtl', 0.045), ('verspoor', 0.045), ('fung', 0.045), ('asia', 0.045), ('wan', 0.042), ('ru', 0.042), ('node', 0.04), ('iteratively', 0.04), ('corpora', 0.04), ('edit', 0.039), ('abstracted', 0.038), ('kupiec', 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping
Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie
Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.
2 0.15045866 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names
Author: Raghavendra Udupa ; Shaishav Kumar
Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.
3 0.1390232 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian
Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.
4 0.11523923 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation
Author: Aurelien Max
Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.
5 0.10865881 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
Author: Philip Resnik ; Olivia Buzek ; Chang Hu ; Yakov Kronrod ; Alex Quinn ; Benjamin B. Bederson
Abstract: Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.
6 0.082675472 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
7 0.081165239 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
8 0.070229411 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
9 0.068961442 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models
10 0.068854548 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model
11 0.068645224 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities
12 0.068019271 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
13 0.064886697 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs
14 0.063418292 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
15 0.06230573 74 emnlp-2010-Learning the Relative Usefulness of Questions in Community QA
16 0.062085539 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices
17 0.060331326 39 emnlp-2010-EMNLP 044
18 0.059850618 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
19 0.058529709 84 emnlp-2010-NLP on Spoken Documents Without ASR
20 0.057435103 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
topicId topicWeight
[(0, 0.235), (1, -0.052), (2, -0.088), (3, 0.091), (4, -0.008), (5, 0.0), (6, -0.095), (7, 0.036), (8, -0.109), (9, 0.057), (10, -0.023), (11, 0.059), (12, -0.007), (13, -0.041), (14, -0.005), (15, -0.016), (16, 0.092), (17, -0.021), (18, 0.22), (19, 0.083), (20, -0.006), (21, -0.032), (22, -0.23), (23, 0.01), (24, -0.054), (25, -0.11), (26, -0.155), (27, -0.014), (28, -0.177), (29, -0.088), (30, -0.011), (31, -0.039), (32, -0.102), (33, -0.091), (34, -0.032), (35, 0.027), (36, -0.047), (37, -0.017), (38, 0.084), (39, 0.149), (40, 0.146), (41, -0.192), (42, 0.111), (43, -0.15), (44, 0.103), (45, -0.109), (46, -0.133), (47, -0.122), (48, 0.109), (49, -0.179)]
simIndex simValue paperId paperTitle
same-paper 1 0.96652555 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping
Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie
Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.
2 0.55861139 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names
Author: Raghavendra Udupa ; Shaishav Kumar
Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.
3 0.41409922 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
Author: Danish Contractor ; Govind Kothari ; Tanveer Faruquie ; L V Subramaniam ; Sumit Negi
Abstract: Recent times have seen a tremendous growth in mobile based data services that allow people to use Short Message Service (SMS) to access these data services. In a multilingual society it is essential that data services that were developed for a specific language be made accessible through other local languages also. In this paper, we present a service that allows a user to query a FrequentlyAsked-Questions (FAQ) database built in a local language (Hindi) using Noisy SMS English queries. The inherent noise in the SMS queries, along with the language mismatch makes this a challenging problem. We handle these two problems by formulating the query similarity over FAQ questions as a combinatorial search problem where the search space consists of combinations of dictionary variations of the noisy query and its top-N translations. We demonstrate the effectiveness of our approach on a real-life dataset.
4 0.38732839 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian
Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.
5 0.35004213 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs
Author: Ioannis Klapaftis ; Suresh Manandhar
Abstract: Graph-based methods have gained attention in many areas of Natural Language Processing (NLP) including Word Sense Disambiguation (WSD), text summarization, keyword extraction and others. Most of the work in these areas formulate their problem in a graph-based setting and apply unsupervised graph clustering to obtain a set of clusters. Recent studies suggest that graphs often exhibit a hierarchical structure that goes beyond simple flat clustering. This paper presents an unsupervised method for inferring the hierarchical grouping of the senses of a polysemous word. The inferred hierarchical structures are applied to the problem of word sense disambiguation, where we show that our method performs sig- nificantly better than traditional graph-based methods and agglomerative clustering yielding improvements over state-of-the-art WSD systems based on sense induction.
6 0.33490723 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
7 0.31388727 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
8 0.30284315 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
9 0.30260095 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference
10 0.2893151 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars
11 0.2882024 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation
12 0.27217782 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
13 0.23768392 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar
14 0.23525526 74 emnlp-2010-Learning the Relative Usefulness of Questions in Community QA
15 0.23026706 68 emnlp-2010-Joint Inference for Bilingual Semantic Role Labeling
16 0.2285794 77 emnlp-2010-Measuring Distributional Similarity in Context
17 0.22609119 19 emnlp-2010-Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation
18 0.22531511 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models
19 0.21503051 84 emnlp-2010-NLP on Spoken Documents Without ASR
20 0.20853518 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model
topicId topicWeight
[(12, 0.588), (29, 0.06), (30, 0.016), (32, 0.014), (52, 0.026), (56, 0.039), (66, 0.08), (72, 0.037), (76, 0.011), (82, 0.014), (89, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.91624272 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping
Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie
Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.
2 0.88929665 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
Author: David Guthrie ; Mark Hepple
Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.
3 0.85376745 95 emnlp-2010-SRL-Based Verb Selection for ESL
Author: Xiaohua Liu ; Bo Han ; Kuan Li ; Stephan Hyeonjun Stiller ; Ming Zhou
Abstract: In this paper we develop an approach to tackle the problem of verb selection for learners of English as a second language (ESL) by using features from the output of Semantic Role Labeling (SRL). Unlike existing approaches to verb selection that use local features such as n-grams, our approach exploits semantic features which explicitly model the usage context of the verb. The verb choice highly depends on its usage context which is not consistently captured by local features. We then combine these semantic features with other local features under the generalized perceptron learning framework. Experiments on both indomain and out-of-domain corpora show that our approach outperforms the baseline and achieves state-of-the-art performance. 1
4 0.41297883 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names
Author: Raghavendra Udupa ; Shaishav Kumar
Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.
5 0.39053896 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction
Author: Alla Rozovskaya ; Dan Roth
Abstract: In this paper, we consider the problem of generating candidate corrections for the task of correcting errors in text. We focus on the task of correcting errors in preposition usage made by non-native English speakers, using discriminative classifiers. The standard approach to the problem assumes that the set of candidate corrections for a preposition consists of all preposition choices participating in the task. We determine likely preposition confusions using an annotated corpus of nonnative text and use this knowledge to produce smaller sets of candidates. We propose several methods of restricting candidate sets. These methods exclude candidate prepositions that are not observed as valid corrections in the annotated corpus and take into account the likelihood of each preposition confusion in the non-native text. We find that restricting candidates to those that are ob- served in the non-native data improves both the precision and the recall compared to the approach that views all prepositions as possible candidates. Furthermore, the approach that takes into account the likelihood of each preposition confusion is shown to be the most effective.
6 0.3897253 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
7 0.38225505 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs
8 0.36305586 20 emnlp-2010-Automatic Detection and Classification of Social Events
9 0.34375393 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation
10 0.34222662 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
11 0.34146231 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications
12 0.33679146 19 emnlp-2010-Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation
13 0.33185026 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
14 0.33177212 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
15 0.33038533 68 emnlp-2010-Joint Inference for Bilingual Semantic Role Labeling
16 0.32910696 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
17 0.32703742 84 emnlp-2010-NLP on Spoken Documents Without ASR
18 0.32016233 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields
19 0.31824246 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
20 0.31727144 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding