emnlp emnlp2012 emnlp2012-39 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Atsushi Fujita ; Pierre Isabelle ; Roland Kuhn
Abstract: This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. Remaining noise can be further reduced by filtering seed paraphrases.
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. [sent-3, score-1.097]
2 Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. [sent-4, score-0.451]
3 While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. [sent-5, score-0.709]
4 We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. [sent-6, score-0.927]
5 In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. [sent-7, score-0.834]
6 Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. [sent-8, score-1.109]
7 Because “equivalence” is the most fundamental semantic relationship, techniques for generating and recognizing paraphrases play an important role in a wide range of natural language processing tasks (Madnani and Dorr, 2010). [sent-11, score-0.414]
8 In the last decade, automatic acquisition ofknowledge about paraphrases from corpora has been drawing the attention of many researchers. [sent-12, score-0.517]
9 clooonktro lilk system ⇔ cbolentroller The challenge in acquiring paraphrases is to ensure good coverage of the targeted classes of paraphrases along with a low proportion of incorrect pairs. [sent-20, score-0.887]
10 However, no matter what type of resource has been used, it has proven difficult to acquire paraphrase pairs with both high recall and high precision. [sent-21, score-0.548]
11 Among various types of corpora, monolingual corpora can be considered the best source for highcoverage paraphrase acquisition, because there is far more monolingual than bilingual text available. [sent-22, score-1.013]
12 However, if one uses purely distributional criteria, it is difficult to distinguish real paraphrases from pairs of expressions that are related in other ways, such as antonyms and cousin words. [sent-24, score-0.627]
13 In contrast, since the work in (Bannard and Callison-Burch, 2005), bilingual parallel corpora have been acknowledged as a good source of high- quality paraphrases: paraphrases are obtained by putting together expressions that receive the same translation in the other language (pivot language). [sent-25, score-0.91]
14 Because translation expresses a specific meaning more directly than context in the aforementioned approach, pairs of expressions acquired in this manner tend to be correct paraphrases. [sent-26, score-0.3]
15 However, the coverage problem remains: there is much less bilingual parallel than monolingual text available. [sent-27, score-0.477]
16 To achieve this, we propose a method that exploits general patterns underlying paraphrases and uses both bilingual parallel and monolingual sources of information. [sent-31, score-0.959]
17 Given a relatively high-quality set of paraphrases obtained from a bilingual parallel corpus, a set of paraphrase patterns is first induced. [sent-32, score-1.263]
18 Section 4 describes our experiments in acquiring paraphrases and presents statistics summarizing the coverage of our method. [sent-37, score-0.473]
19 2 Literature on Paraphrase Acquisition This section summarizes existing corpus-based methods for paraphrase acquisition, following the classification in (Hashimoto et al. [sent-40, score-0.466]
20 Because a large quantity of monolingual data is available for many languages, a large number of paraphrase candidates can be acquired (Lin and Pantel, 2001 ; Pas ¸ca and Dienes, 2005; Bhagat and Ravichandran, 2008, etc. [sent-44, score-0.752]
21 (2003) 632 created monolingual parallel corpora from multiple human translations of the same source. [sent-55, score-0.346]
22 Leveraging recent advances in statistical machine translation (SMT), Bannard and CallisonBurch (2005) proposed a method for acquiring subsentential paraphrases from bilingual parallel corpora. [sent-57, score-0.79]
23 The likelihood of e2 being a paraphrase of e1 is given by p(e2|e1) = ∑ p(e2|f)p(f|e1), (1) f∈ T∑r(e1 ,e2) where Tr(e1 , e2) stands for the set of shared translations of e1 and e2. [sent-60, score-0.466]
24 Kok and Brockett (2010) showed how one can discover paraphrases that do not share any translation in one language by traversing a graph created from multiple translation tables, each corresponding to a bilingual parallel corpus. [sent-63, score-0.811]
25 This approach, however, suffers from a coverage problem, because both monolingual parallel and bilingual parallel corpora tend to be significantly smaller than monolingual non-parallel corpora. [sent-64, score-0.823]
26 One limitation of this approach is that it requires a considerable amount of labeled data for both the corpus construc- tion and the paraphrase extraction steps. [sent-76, score-0.466]
27 3 Summary Existing methods have investigated one of the following four types of corpora as their principal resource1 : monolingual non-parallel corpora, monolingual parallel corpora, monolingual comparable corpora, and bilingual parallel corpora. [sent-78, score-0.98]
28 No matter what type of resource has been used, however, it has proven difficult to acquire paraphrases with both high recall and precision, with the possible exception of the method in (Hashimoto et al. [sent-79, score-0.438]
29 3 Proposed Method While most existing methods deal with expressions only at the surface level, ours exploits generalities underlying paraphrases to achieve better coverage while retaining high precision. [sent-81, score-0.538]
30 Furthermore, unlike existing methods, ours uses both bilingual parallel and monolingual non-parallel corpora as sources for acquiring paraphrases. [sent-82, score-0.543]
31 First, a set of high-quality seed paraphrases, PSeed, is ac- quired from bilingual parallel corpora by using an alignment-based method. [sent-84, score-0.437]
32 Then, our method collects further paraphrases through the following two steps. [sent-85, score-0.414]
33 Instantiation (Step 3): A novel set of paraphrase pairs, PHvst, is finally harvested from monolingual non-parallel corpora using the learned patterns; each newly acquired paraphrase pair is assessed by contextual similarity. [sent-87, score-1.343]
34 (2011) used monolingual corpora only for reranking paraphrases obtained from bilingual parallel corpora. [sent-89, score-0.924]
35 Seed Paraphrase Acquisition The goal of the first step is to obtain a set of highquality paraphrase pairs, PSeed. [sent-94, score-0.489]
36 For this purpose, alignment-based methods with bilingual or monolingual parallel corpora are preferable to similarity-based methods applied to non- parallel corpora. [sent-95, score-0.635]
37 Among various options, in this paper, we start from the standard technique proposed by Bannard and Callison-Burch (2005) with bilingual parallel corpora (see also Section 2. [sent-96, score-0.348]
38 As a result, a naive application of the paraphrase acquisition method produces pairs of expressions that are not exact paraphrases. [sent-102, score-0.639]
39 Let PRaw be the initial set of paraphrase pairs extracted from the sanitized translation table. [sent-109, score-0.578]
40 Given a set of paraphrase pairs, RHS phrases corresponding to the same LHS phrase lp are compared. [sent-121, score-0.676]
41 A RHS phrase rp is not licensed iff lp has another RHS phrase rp′ (̸= rp) which satisfies the following two conditions( (see pa)ls wo Figure 2). [sent-122, score-0.358]
42 • rp′ is a word sub-sequence of rp • rp′ is a more likely paraphrase than rp, p(rp′| lp) > p(rp| lp) i. [sent-123, score-0.589]
43 , a LHS phrase lp is not qualified as a legitimate source of rp iff rp has another LHS phrase lp′ (̸= lp) which satisfies the following conditions (see =also lp Figure 3). [sent-127, score-0.634]
44 Furthermore, we also require that LHS and RHS phrases exceed a threshold (ths) on their contextual similarity in a monolingual corpus. [sent-139, score-0.346]
45 Paraphrase Pattern Induction From a set of seed paraphrases, PSeed, paraphrase patterns are induced. [sent-144, score-0.649]
46 For instance, from paraphrases in (3), we induce paraphrase patterns in (4). [sent-145, score-0.974]
47 Note that our aim is to automatically capture general paraphrase patterns of the kind that have sometimes been manually described (Jacquemin, 1999; Fujita et al. [sent-157, score-0.56]
48 This is different from approaches that attach variable slots to paraphrases for calculating their similarity (Lin and Pantel, 2001 ; Szpektor and Dagan, 2008) or for constraining the context in which they are regarded legitimate (Callison-Burch, 2008; Zhao et al. [sent-159, score-0.53]
49 Paraphrase Instance Acquisition Given a set of paraphrase patterns, such as those shown in (4), a set of novel instances, i. [sent-163, score-0.466]
50 We estimate how likely RHS(w) is to be a paraphrase of LHS(w) based on the contextual similarity between them using a monolingual corpus; a pair of phrases is discarded if they are used in substantially dissimilar contexts. [sent-173, score-0.766]
51 However, this is not a problem in our framework, because semantic equivalence between LHS(w) and RHS(w) is almost entirely guaranteed as a result of the way the corresponding patterns were learned from a bilingual parallel corpus. [sent-176, score-0.383]
52 8M sentence pairs (51M words in English and 56M words in French) was used as a bilingual parallel corpus, while its English side and the English side of the French-English corpus4 consisting of 23. [sent-192, score-0.391]
53 2M sentence pairs (122M morphemes in Japanese and 106M words in English) was used as a bilingual parallel corpus, while its English side and the 30. [sent-196, score-0.369]
54 98 Figure 4: # of paraphrase pairs in PSeed (left: Europarl, right: Patent). [sent-206, score-0.524]
55 Stop word lists for sanitizing translation pairs and paraphrase pairs were manually compiled: we enumerated 442 English words, 193 French words, and 149 Japanese morphemes, respectively. [sent-207, score-0.636]
56 From a bilingual parallel corpus, a translation table was created by our in-house phrase-based SMT system, PORTAGE (Sadat et al. [sent-208, score-0.343]
57 As contextual features for computing similarity of each paraphrase pair, all of the 1- to 4-grams of words adjacent to each occurrence of a phrase were counted. [sent-213, score-0.597]
58 2 Statistics on Acquired Paraphrases Seed Paraphrases (PSeed) of paraphrase pairs PSeed obtained from the bilingual parallel corpora. [sent-219, score-0.813]
59 The general trend is simply that the larger the corpus is, the more paraphrases are acquired. [sent-220, score-0.414]
60 The percentage of paraphrase pairs thereby discarded varied greatly depending on the corpus size (17-78% in Europarl and 31-82% in Patent), suggesting that the threshold value should be determined depending on the given corpus. [sent-236, score-0.57]
61 01 (“△”) to ensure the quality of PSeed that we will be u“s△ing )fo tor inducing paraphrase patterns, even though this results in discarding some less frequent but correct paraphrase pairs, such as “control apparatus” ⇒ “controlling device” in Figure 2. [sent-238, score-0.955]
62 Paraphrase Patterns Paraphrase Patterns Figures 5 and 6 show the number of paraphrase patterns that our method induced and their coverage against PSeed, respectively. [sent-239, score-0.586]
63 Figure 7: # of paraphrase pairs and unique LHS phrases in PSeed and PHvst (left: Europarl, right: Patent). [sent-248, score-0.596]
64 Novel Paraphrases (PHvst) Using the paraphrase patterns, novel paraphrase pairs, PHvst, were harvested from the monolingual non-parallel corpora. [sent-249, score-1.139]
65 Nevertheless, we managed to acquire a large number of paraphrase pairs as depicted in Figure 7, where pairs having zero similarity were excluded. [sent-251, score-0.651]
66 Figure 8 highlights the remarkably large ratio of PHvst to PSeed in terms of the number of paraphrase pairs and the number of unique LHS phrases. [sent-257, score-0.552]
67 The alignment-based method with bilingual corpora cannot produce very many RHS phrases per unique LHS phrase due to its reliance on conditional probability and the surface level processing. [sent-264, score-0.332]
68 One limitation of our method is that it cannot achieve high yield for PHvst whenever only a small number of paraphrase patterns can be extracted from the bilingual corpus (see also Figure 5). [sent-266, score-0.724]
69 Similarity threshold ths Figure 10: # of acquired paraphrase pairs against threshold values. [sent-276, score-0.849]
70 , thp on the conditional probability and ths on the contextual similarity, respectively. [sent-283, score-0.359]
71 When the pairs were filtered only with thp, the number of paraphrase pairs in PHvst decreased more slowly than that of PSeed according to the increase ofthe threshold value. [sent-285, score-0.665]
72 The same paraphrase pattern is often induced from more than one paraphrase pair in PSeed. [sent-287, score-0.952]
73 Thus, as long as at least one of them has a probability higher than the given threshold value, corresponding novel paraphrases can be harvested. [sent-288, score-0.46]
74 On the other hand, as a results of assessing each individual paraphrase pair by the contextual similarity, many pairs in PHvst, which are supposed to be incorrect instances of their corresponding pattern, are filtered out by a larger threshold value for ths. [sent-289, score-0.656]
75 First, by substituting sub-sentential paraphrases to existing sentences in a given test corpus, pairs of slightly different sentences were automatically generated. [sent-296, score-0.472]
76 , 2011), we evaluated paraphrases acquired from the Europarl corpus on news sentences. [sent-314, score-0.51]
77 On the other hand, paraphrases acquired from patent documents are much more difficult to evaluate due to the following reasons. [sent-318, score-0.623]
78 We expect that paraphrases from a domain can be used safely in that domain. [sent-321, score-0.414]
79 To propose multiple paraphrase candidates at the same time, we also restricted phrases to be paraphrased (LHS phrases) to those having at least five paraphrases including ones from PHvst. [sent-324, score-0.96]
80 This resulted in 60,421 paraphrases for 988 phrase tokens (353 unique phrases). [sent-325, score-0.479]
81 Finally, we randomly sampled 80 unique phrase tokens and five unique paraphrases for each phrase token (400 examples in total), and asked six people having a high level of English proficiency to evaluate them. [sent-326, score-0.544]
82 The performance of paraphrases drawn from PHvst was reasonably high and similar to the scores 0. [sent-331, score-0.414]
83 The most promising way for improving the quality of PHvst is to ensure that paraphrase patterns cover only legitimate paraphrases. [sent-342, score-0.607]
84 We investigated this by filtering the manually scored paraphrase examples with two thresholds for cleaning seed paraphrases PSeed: thp on the conditional probability estimated using the bilingual parallel corpus and ths on the contextual similarity in the monolingual nonparallel corpus. [sent-343, score-1.931]
85 Figure 11 shows the average score of the examples whose corresponding paraphrase is obtainable with the given threshold values. [sent-344, score-0.512]
86 9vestd)1 Similarity threshold ths Figure 11: Average score of paraphrase examples against threshold values. [sent-360, score-0.695]
87 Nevertheless, it indicates that better filtering of PSeed with higher threshold values is likely to produce a better-quality set of paraphrases PHvst. [sent-368, score-0.525]
88 For instance, an inappropriate paraphrase pattern (9a) was excluded with thp = 0. [sent-369, score-0.659]
89 6 Conclusion In this paper, we exploited general patterns underlying paraphrases to acquire automatically a large number of high-quality paraphrase pairs using both bilingual parallel and monolingual non-parallel corpora. [sent-385, score-1.507]
90 Experiments using two sets of corpora demonstrated that our method is able to leverage information in a relatively small bilingual parallel corpus to exploit large amounts of information in a relatively large monolingual non-parallel corpus. [sent-386, score-0.51]
91 Human evaluation through a paraphrase substitution test revealed that the acquired paraphrases are generally of reasonable quality. [sent-387, score-0.999]
92 Our original objective was to extract from monolingual corpora a large quantity of paraphrases whose quality is as high as 640 that of paraphrases from bilingual parallel corpora. [sent-388, score-1.389]
93 2, exploitation of patterns with more than one variable, learning curve experiments with different amounts of monolingual data, and comparison of in-domain and general-purpose monolingual corpora. [sent-395, score-0.418]
94 Second, we have an interest in exploiting sophisticated paraphrase patterns; for instance, by inducing patterns hierarchically (recursively) and incorporating lexical resources such as those exemplified in (4). [sent-396, score-0.56]
95 Finally, the developed paraphrase collection will be attested through applications, such as sentence compression (Cohn and Lapata, 2008; Ganitkevitch et al. [sent-397, score-0.466]
96 Large scale acquisition of paraphrases for learning surface patterns. [sent-418, score-0.458]
97 Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. [sent-453, score-0.591]
98 Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. [sent-469, score-0.762]
99 Filtering antonymous, trend-contrasting, and polarity-dissimilar distributional paraphrases for improving statistical machine translation. [sent-518, score-0.434]
100 Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. [sent-530, score-0.414]
wordName wordTfidf (topN-words)
[('paraphrase', 0.466), ('paraphrases', 0.414), ('pseed', 0.346), ('phvst', 0.293), ('lhs', 0.191), ('thp', 0.173), ('bilingual', 0.164), ('rhs', 0.163), ('monolingual', 0.162), ('ths', 0.137), ('lp', 0.129), ('parallel', 0.125), ('rp', 0.123), ('patent', 0.113), ('acquired', 0.096), ('patterns', 0.094), ('seed', 0.089), ('expressions', 0.071), ('europarl', 0.071), ('filtering', 0.065), ('apparatus', 0.062), ('corpora', 0.059), ('pairs', 0.058), ('hashimoto', 0.057), ('translation', 0.054), ('contextual', 0.049), ('threshold', 0.046), ('denkowski', 0.046), ('fujita', 0.046), ('similarity', 0.045), ('harvested', 0.045), ('acquisition', 0.044), ('phrases', 0.044), ('bannard', 0.041), ('cousin', 0.04), ('grammaticality', 0.038), ('evaluators', 0.038), ('phrase', 0.037), ('chris', 0.037), ('filtered', 0.037), ('paraphrased', 0.036), ('atsushi', 0.034), ('eastern', 0.034), ('recipe', 0.034), ('acquiring', 0.033), ('iff', 0.032), ('marton', 0.032), ('roland', 0.029), ('szpektor', 0.029), ('quantity', 0.028), ('unique', 0.028), ('device', 0.027), ('east', 0.027), ('shinyama', 0.027), ('smt', 0.027), ('dass', 0.027), ('europ', 0.027), ('generalities', 0.027), ('hakodate', 0.027), ('osungen', 0.027), ('portage', 0.027), ('praw', 0.027), ('prehistoric', 0.027), ('restraint', 0.027), ('sadat', 0.027), ('coverage', 0.026), ('stop', 0.026), ('instantiation', 0.025), ('dolan', 0.025), ('pivot', 0.025), ('slots', 0.024), ('legitimate', 0.024), ('antonyms', 0.024), ('acquire', 0.024), ('french', 0.024), ('barzilay', 0.023), ('regarded', 0.023), ('substitution', 0.023), ('chan', 0.023), ('du', 0.023), ('quality', 0.023), ('highquality', 0.023), ('lizard', 0.023), ('recipes', 0.023), ('control', 0.023), ('side', 0.022), ('lavie', 0.021), ('meaning', 0.021), ('investigated', 0.021), ('fujii', 0.021), ('wubben', 0.021), ('schools', 0.021), ('nonparallel', 0.021), ('ntcir', 0.021), ('kato', 0.021), ('xx', 0.021), ('pattern', 0.02), ('canada', 0.02), ('distributional', 0.02), ('paraphrasing', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
Author: Atsushi Fujita ; Pierre Isabelle ; Roland Kuhn
Abstract: This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. Remaining noise can be further reduced by filtering seed paraphrases.
2 0.53034103 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
Author: Aurelien Max ; Houda Bouamor ; Anne Vilnat
Abstract: This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphrase recognition. A detailed quantified typology of subsentential paraphrases found in our corpus types is given.
3 0.35527349 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
4 0.152771 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita
Abstract: This paper proposes a novel method for lexicon extraction that extracts translation pairs from comparable corpora by using graphbased label propagation. In previous work, it was established that performance drastically decreases when the coverage of a seed lexicon is small. We resolve this problem by utilizing indirect relations with the bilingual seeds together with direct relations, in which each word is represented by a distribution of translated seeds. The seed distributions are propagated over a graph representing relations among words, and translation pairs are extracted by identifying word pairs with a high similarity in the seed distributions. We propose two types of the graphs: a co-occurrence graph, representing co-occurrence relations between words, and a similarity graph, representing context similarities between words. Evaluations using English and Japanese patent comparable corpora show that our proposed graph propagation method outperforms conventional methods. Further, the similarity graph achieved improved performance by clustering synonyms into the same translation.
5 0.15227786 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
Author: Michael Roth ; Anette Frank
Abstract: Generating coherent discourse is an important aspect in natural language generation. Our aim is to learn factors that constitute coherent discourse from data, with a focus on how to realize predicate-argument structures in a model that exceeds the sentence level. We present an important subtask for this overall goal, in which we align predicates across comparable texts, admitting partial argument structure correspondence. The contribution of this work is two-fold: We first construct a large corpus resource of comparable texts, including an evaluation set with manual predicate alignments. Secondly, we present a novel approach for aligning predicates across comparable texts using graph-based clustering with Mincuts. Our method significantly outperforms other alignment techniques when applied to this novel alignment task, by a margin of at least 6.5 percentage points in F1-score.
6 0.11824635 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
7 0.07589823 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
8 0.07220874 11 emnlp-2012-A Systematic Comparison of Phrase Table Pruning Techniques
9 0.07050515 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
10 0.069209434 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
11 0.06867943 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
12 0.063037269 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
13 0.059945416 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
14 0.055237744 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types
15 0.055028453 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
16 0.049749464 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence
17 0.048237789 104 emnlp-2012-Parse, Price and Cut-Delayed Column and Row Generation for Graph Based Parsers
18 0.046629243 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models
19 0.046341568 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence
20 0.044724122 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
topicId topicWeight
[(0, 0.205), (1, -0.036), (2, -0.483), (3, 0.02), (4, 0.202), (5, 0.38), (6, 0.15), (7, -0.12), (8, -0.15), (9, 0.006), (10, 0.206), (11, -0.006), (12, 0.107), (13, -0.09), (14, 0.031), (15, 0.09), (16, -0.015), (17, 0.075), (18, -0.074), (19, 0.023), (20, -0.028), (21, -0.02), (22, -0.038), (23, -0.124), (24, -0.011), (25, -0.002), (26, 0.021), (27, 0.057), (28, -0.079), (29, -0.051), (30, -0.004), (31, 0.013), (32, 0.022), (33, 0.09), (34, 0.028), (35, -0.072), (36, 0.032), (37, -0.034), (38, 0.009), (39, 0.005), (40, 0.018), (41, 0.05), (42, -0.015), (43, 0.042), (44, 0.009), (45, -0.0), (46, -0.044), (47, 0.054), (48, -0.039), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.96536267 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
Author: Atsushi Fujita ; Pierre Isabelle ; Roland Kuhn
Abstract: This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. Remaining noise can be further reduced by filtering seed paraphrases.
2 0.93330526 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
Author: Aurelien Max ; Houda Bouamor ; Anne Vilnat
Abstract: This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphrase recognition. A detailed quantified typology of subsentential paraphrases found in our corpus types is given.
3 0.81608701 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
4 0.40426204 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
Author: Michael Roth ; Anette Frank
Abstract: Generating coherent discourse is an important aspect in natural language generation. Our aim is to learn factors that constitute coherent discourse from data, with a focus on how to realize predicate-argument structures in a model that exceeds the sentence level. We present an important subtask for this overall goal, in which we align predicates across comparable texts, admitting partial argument structure correspondence. The contribution of this work is two-fold: We first construct a large corpus resource of comparable texts, including an evaluation set with manual predicate alignments. Secondly, we present a novel approach for aligning predicates across comparable texts using graph-based clustering with Mincuts. Our method significantly outperforms other alignment techniques when applied to this novel alignment task, by a margin of at least 6.5 percentage points in F1-score.
5 0.39115867 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita
Abstract: This paper proposes a novel method for lexicon extraction that extracts translation pairs from comparable corpora by using graphbased label propagation. In previous work, it was established that performance drastically decreases when the coverage of a seed lexicon is small. We resolve this problem by utilizing indirect relations with the bilingual seeds together with direct relations, in which each word is represented by a distribution of translated seeds. The seed distributions are propagated over a graph representing relations among words, and translation pairs are extracted by identifying word pairs with a high similarity in the seed distributions. We propose two types of the graphs: a co-occurrence graph, representing co-occurrence relations between words, and a similarity graph, representing context similarities between words. Evaluations using English and Japanese patent comparable corpora show that our proposed graph propagation method outperforms conventional methods. Further, the similarity graph achieved improved performance by clustering synonyms into the same translation.
6 0.26394194 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
7 0.23430447 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
8 0.22844499 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
9 0.21766765 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
10 0.20587119 44 emnlp-2012-Excitatory or Inhibitory: A New Semantic Orientation Extracts Contradiction and Causality from the Web
11 0.20078659 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
12 0.17546481 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
13 0.16358763 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage
14 0.16232638 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types
15 0.15976457 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
16 0.15768623 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
17 0.15534887 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation
18 0.15112524 69 emnlp-2012-Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation
19 0.14970775 31 emnlp-2012-Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition
20 0.14619969 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models
topicId topicWeight
[(2, 0.024), (16, 0.022), (25, 0.01), (30, 0.216), (34, 0.069), (45, 0.025), (60, 0.204), (63, 0.079), (64, 0.027), (65, 0.027), (70, 0.047), (74, 0.05), (76, 0.034), (79, 0.012), (80, 0.015), (86, 0.021), (95, 0.018)]
simIndex simValue paperId paperTitle
1 0.90850735 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision
Author: Joohyun Kim ; Raymond Mooney
Abstract: “Grounded” language learning employs training data in the form of sentences paired with relevant but ambiguous perceptual contexts. B ¨orschinger et al. (201 1) introduced an approach to grounded language learning based on unsupervised PCFG induction. Their approach works well when each sentence potentially refers to one of a small set of possible meanings, such as in the sportscasting task. However, it does not scale to problems with a large set of potential meanings for each sentence, such as the navigation instruction following task studied by Chen and Mooney (201 1). This paper presents an enhancement of the PCFG approach that scales to such problems with highly-ambiguous supervision. Experimental results on the navigation task demonstrates the effectiveness of our approach.
same-paper 2 0.85476983 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
Author: Atsushi Fujita ; Pierre Isabelle ; Roland Kuhn
Abstract: This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. Remaining noise can be further reduced by filtering seed paraphrases.
3 0.73392552 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification
Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson
Abstract: The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model.
4 0.73287225 61 emnlp-2012-Grounded Models of Semantic Representation
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
5 0.72733176 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
6 0.7178368 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
7 0.7169407 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
8 0.71576309 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
9 0.71160275 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
10 0.70962423 19 emnlp-2012-An Entity-Topic Model for Entity Linking
11 0.70688999 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
12 0.70480478 41 emnlp-2012-Entity based QA Retrieval
13 0.70460272 84 emnlp-2012-Linking Named Entities to Any Database
14 0.7045716 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction
15 0.70095521 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes
16 0.69934368 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
17 0.69822633 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
18 0.69811213 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
19 0.69801104 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
20 0.69571584 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP