emnlp emnlp2012 emnlp2012-135 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
Reference: text
sentIndex sentText sentNum sentScore
1 regne ri@ co l uni -s aarl and de i Abstract Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. [sent-4, score-0.748]
2 We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. [sent-5, score-0.46]
3 We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. [sent-6, score-1.184]
4 1 Introduction It is widely agreed that identifying paraphrases is a core task for natural language processing, including applications like document summarization (Barzilay et al. [sent-8, score-0.254]
5 As a consequence, many methods have been proposed for generating large paraphrase resources (Lin and Pantel, 2001 ; Szpektor et al. [sent-14, score-0.398]
6 Most approaches that extract paraphrases from parallel texts employ some type of pattern match916 Rui Wang Language Technology Lab DFKI GmbH Saarbrücken, Germany ruiwang@ dfki . [sent-18, score-0.407]
7 Another approach (Deléger and Zweigenbaum, 2009) matches similar paragraphs in comparable texts, creating smaller comparable documents for paraphrase extraction. [sent-23, score-0.494]
8 We believe that discourse structure delivers important information for the extraction of paraphrases. [sent-24, score-0.216]
9 Sentences that play the same role in a certain discourse and have a similar discourse context can be paraphrases, even if a semantic similarity model does not consider them very similar. [sent-25, score-0.371]
10 Based on this assumption, we propose a novel method for collecting paraphrases from parallel texts using discourse information. [sent-28, score-0.546]
11 The discourse structures of those summaries are easy to compare: they all contain the events in the same order as they have appeared on the screen. [sent-30, score-0.215]
12 This allows us to take sentence order as event-based discourse structure, which is highly parallel for recaps of the same episode. [sent-31, score-0.404]
13 The approach outperforms informed baselines on the task of sentential paraphrase identification. [sent-35, score-0.533]
14 The usage of discourse information even contributes more to the final performance than the sentence similarity measure. [sent-36, score-0.244]
15 As second step, we extract phrase-level paraphrase fragments from the matched sentences. [sent-37, score-0.617]
16 This step relies on the alignment algorithm’s output, and we show that discourse information makes a big difference for the precision of the extraction. [sent-38, score-0.336]
17 We then add more discourse-based information by preprocessing the text with a coreference resolution system, which results in additional performance improvement. [sent-39, score-0.207]
18 2 Related Work Previous paraphrase extraction approaches can be roughly characterized under two aspects: 1) data source and 2) granularity of the output. [sent-50, score-0.446]
19 Other studies focus on extracting paraphrases from large bilingual parallel corpora, which the machine translation (MT) community provides in many varieties. [sent-59, score-0.37]
20 (2008) take one language as the pivot and match two possible translations in the other languages as paraphrases if they share a common pivot 917 phrase. [sent-61, score-0.314]
21 As parallel corpora have many alternative ways of expressing the same foreign language concept, large quantities of paraphrase pairs can be extracted. [sent-62, score-0.519]
22 The paraphrasing task is also strongly related to cross-document event coreference resolution, which is tackled by similar techniques used by the available paraphrasing systems (Bagga and Baldwin, 1999; Tomadaki and Salway, 2005). [sent-63, score-0.246]
23 Most work in paraphrase acquisition has dealt with sentence-level paraphrases, e. [sent-64, score-0.398]
24 Our approach for sentential paraphrase extraction is related to the one introduced by Barzilay and Lee (2003), who also employ multiple sequence alignment (MSA). [sent-69, score-0.628]
25 However, they use MSA at the sentence level rather than at the discourse level. [sent-70, score-0.209]
26 In this current work, we target the general task of extracting paraphrases for events rather than the much more specific script- related task. [sent-76, score-0.284]
27 From an applicational point of view, sentential paraphrases are difficult to use in other NLP tasks. [sent-78, score-0.323]
28 The research on general paraphrase fragment extraction at the sub-sentential level is mainly based on phrase pair extraction techniques from the MT literature. [sent-85, score-0.772]
29 Our own work (Wang and Callison-Burch, 2011) extends the first idea to paraphrase fragment extraction on monolingual parallel and comparable corpora. [sent-89, score-0.887]
30 3 Paraphrases and Discourse Previous approaches have shown that comparable texts provide a good basis for paraphrase extraction. [sent-92, score-0.484]
31 We want to show that discourse structure is highly useful for precise and high-yield paraphrase collection from such corpora. [sent-93, score-0.566]
32 A system which recognizes that the three sentence pairs occur in the same sequential event order would have a chance of actually matching the sentences. [sent-116, score-0.215]
33 Create a corpus: First, we create a comparable corpus of texts with highly comparable discourse structures. [sent-130, score-0.302]
34 , 2002) may be very useful for paraphrase computation, however, they are hard to obtain. [sent-132, score-0.398]
35 To circumvent this problem, we assemble documents that have parallel discourse structures by default: We compile multiple plot summaries of TV show episodes. [sent-134, score-0.301]
36 We take sentence sequences of recaps as parallel discourse structures. [sent-136, score-0.404]
37 Extract sentence-level paraphrases: Our system finds sentence pairs that are either paraphrases themselves, or at least contain paraphrase fragments. [sent-138, score-0.763]
38 This procedure crucially relies on discourse knowledge: A Multiple Sequence Alignment (MSA) algorithm matches sentences if both their inherent semantic similarities and the overall similarity score of their discourse contexts are high enough. [sent-139, score-0.401]
39 Extract paraphrase fragments: Sentencelevel paraphrases may be too specific for further domain-independent applications, as they row recap 1 recap 2 recap 3 recap 4 recap 5 34Smhaenogniev shotF. [sent-141, score-1.197]
40 Figure 2: Excerpt from an alignment table for with a 5 exemplary recaps of Episode 2 (Season 6). [sent-156, score-0.253]
41 Thus we take a necessary second step and extract finer-grained paraphrase fragments from the sentence pairs matched in step 2. [sent-160, score-0.693]
42 4 Sentence Matching with MSA This section explains how we apply MSA to extract sentence-level paraphrases from a comparable corpus. [sent-164, score-0.331]
43 5 Paraphrase Fragment Extraction Taking the output of the sentence alignment as in- put, we next extract shorter phrase-level paraphrases (paraphrase fragments) from the matched sentence pairs. [sent-203, score-0.53]
44 1 Preprocessing Before extracting paraphrase fragments, we first preprocess all documents as follows: Stanford CoreNLP 2 provides a set of natural language analysis tools. [sent-206, score-0.428]
45 We use the part-ofspeech (POS) tagger, the named-entity recognizer, the parser (Klein and Manning, 2003), and the coreference resolution system (Lee et al. [sent-207, score-0.213]
46 The output from the coreference resolution system is used to cluster all mentions referring to the same entity and to select one as the representative mention. [sent-215, score-0.271]
47 Note that the coreference resolution system is applied to each recap as a whole. [sent-217, score-0.322]
48 We mainly follow our previous approach (Wang and CallisonBurch, 2011), which is a modified version of an approach by Munteanu and Marcu (2006) on translation fragment extraction. [sent-225, score-0.278]
49 All the word alignments (excluding stop-words) with positive scores are selected as candidate fragment elements. [sent-235, score-0.278]
50 Provided with the candidate fragment elements, we previously (Wang and Callison-Burch, 2011) used a chunker3 to finalize the output fragments, in order to follow the linguistic definition of a (para-) phrase. [sent-236, score-0.278]
51 Finally, we filter out trivial fragment pairs, such as identical or the original sentence pairs. [sent-240, score-0.319]
52 We exclude trivial fragment pairs that are prefWixees e or usudfefix treisv ioalf feraacghm moetnhetr p (or identical). [sent-253, score-0.313]
53 6 Evaluation We evaluate both sentential paraphrase matching and paraphrase fragment extraction using manually labelled gold standards (provided in the supplemen- tary material). [sent-259, score-1.27]
54 We collect recaps for all 20 episodes of season 6 of House M. [sent-260, score-0.201]
55 We thus sample pairs that either the system or the baselines recognized as paraphrases and try to create an evaluation set that is not biased towards the actual system or any of the baselines. [sent-272, score-0.395]
56 The evaluation set consists of 2000 sentence pairs: 400 that the system recognized as paraphrases, 400 positively labelled pairs for each of the three baselines (described in the following section) and 400 randomly selected pairs. [sent-273, score-0.18]
57 This scheme has a double purpose: The main objec- tive is judging whether two sentences contain paraphrases (1-3) or if they are unrelated (4). [sent-283, score-0.287]
58 The algorithm partitions the set of sentences into paraphrase clusters such that the most similar sentences end up in one cluster. [sent-305, score-0.398]
59 Note that the CLUSTER+BLEU system resembles popular n-gram overlap measures for paraphrase classification. [sent-309, score-0.469]
60 Results Overall, our system extracts 20379 paraphrase pairs. [sent-311, score-0.433]
61 This is al4Gap costs directly influence precision and recall: “cheap” gaps lead to a more restrictive system with higher precision, and more expensive gaps give more recall. [sent-318, score-0.187]
62 Intuitively we expected the MSA-based systems to end up with a higher recall than the clustering baselines, because sentences can be matched even if their similarity is moderate or low, but their discourse context is highly similar. [sent-346, score-0.283]
63 The advantage from using the vector-space model that is still obvious for the clustering baselines is nearly evened out when adding discourse knowledge as a backbone. [sent-350, score-0.232]
64 It is hard to do a direct comparison with stateof-the-art paraphrase recognition systems, because most are evaluated on different corpora, e. [sent-352, score-0.398]
65 , the Microsoft paraphrase corpus (Dolan and Brockett, 2005, MSR). [sent-354, score-0.398]
66 While the MSR corpus is larger than our collection, the wording variations in its paraphrase pairs are usually lower than for our examples. [sent-356, score-0.433]
67 2 Paraphrase Fragment Evaluation We also manually evaluate precision on paraphrase fragments, and additionally describe the productivity of the different setups, providing some intuition about the methods’ recall. [sent-369, score-0.514]
68 Gold-Standard We randomly collect 150 fragment pairs for each of the five system configurations (explained in the following section). [sent-370, score-0.348]
69 Each fragment pair (f1, f2) is annotated with one of the following categories: 1. [sent-371, score-0.278]
70 This labeling scheme again assesses precision as well as paraphrase granularity. [sent-379, score-0.453]
71 The agreement for the distinction between the paraphrasecoll categories and irrelevant instances reaches a level of κ = 0. [sent-384, score-0.186]
72 Unlike previous approaches to fragment extraction, we do not evaluate grammaticality, given that the VP-fragment method implicitly constrains the output fragments to be complete phrases. [sent-388, score-0.416]
73 Configurations & Results We take the output of the sentence matching system MSA+VEC as input for paraphrase fragment extraction. [sent-389, score-0.798]
74 5, our core fragment module uses the word-word alignments provided by GIZA++ and uses a chunker for fragment extraction. [sent-391, score-0.587]
75 In addition, we evaluate the influence of coreference resolution by preprocessing the input to the best performing configuration with pronoun resolution (COREF). [sent-393, score-0.322]
76 We mainly compute precision for this task, as the recall of paraphrase fragments is difficult to define. [sent-394, score-0.591]
77 It is defined as the ratio between the number of resulting fragment pairs and the number of sentence pairs used as input. [sent-396, score-0.389]
78 Enhancing the system with coreference resolution raises the score even further. [sent-410, score-0.213]
79 (2008) extracts paraphrase fragments from bilingual parallel corpora and reaches a precision of 0. [sent-413, score-0.707]
80 As a final comparison, we show how the performance of the sentence matching methods directly affects the fragment extraction. [sent-417, score-0.365]
81 We use the VP-based fragment extraction system (VP), and compare the performances by using either the outputs from our main system (MSA+VP) or alternatively the baseline that replaces MSA with a clustering algorithm (CLUSTER+VP). [sent-418, score-0.424]
82 42 Table 3: Impact of MSA on fragment extraction As shown in Tab. [sent-424, score-0.326]
83 01 good fragment pairs per matched sentence pair, and the final system extracts 0. [sent-427, score-0.441]
84 Those numbers show that for any application that acquires paraphrases of arbitrary granularity, sequential event information provides an invaluable source to achieve a lean paraphrasing method with high precision. [sent-429, score-0.357]
85 3 shows exemplary results from our system pipeline, using the VP–FRAGMENTS method with full coreference resolution on the sentence pairs extracted by MSA. [sent-432, score-0.32]
86 The results reflect the importance of discourse information for this task: Sentences are correctly matched in spite ofnot having common deSentence 1[with fragment 1] Sentence 2 [with fragment 2] 1 Taub meets House for dinner and claims [that Rachel had a pottery class]. [sent-433, score-0.807]
87 Figure 3: Example results; fragments extracted from aligned sentences are bracketed and emphasized. [sent-449, score-0.186]
88 Additionally, the coreference resolution allows us to match Rachel (1) and Wilson (5) to the correct corresponding pronouns. [sent-453, score-0.178]
89 All examples show that this technique of matching sentence could even help to make coreference resolution better, because we can easily identify Cameron with his wife, Lydia with the respective pronouns, Nash with The Patient or the nickname Thirteen with Hadley, the character’s actual name. [sent-454, score-0.265]
90 7 Conclusion and Future Work We presented our work on paraphrase extraction using discourse information, on a corpus consisting of recaps of TV show episodes. [sent-455, score-0.723]
91 Our approach first uses MSA to extract sentential paraphrases, which are then further processed to compute finer-grained paraphrase fragments using dependency trees and pronoun resolution. [sent-456, score-0.669]
92 The experimental results show great advantages from using discourse information, beating informed baselines and performing competitively with state-of-the-art systems. [sent-457, score-0.234]
93 This can also help to define the fragment boundaries more clearly. [sent-459, score-0.278]
94 In a more advanced step, we will also use the aligned paraphrases to help resolving discourse structure, e. [sent-461, score-0.47]
95 In a long-term view, it would be interesting to see how aligned discourse trees could help to extract paraphrases from arbitrary parallel text. [sent-464, score-0.585]
96 Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. [sent-533, score-0.331]
97 Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. [sent-548, score-0.484]
98 Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. [sent-565, score-0.409]
99 Stanford’s multi-pass sieve coreference resolution system at the conll-201 1 shared task. [sent-584, score-0.213]
100 Generative models of noisy translations with applications to parallel fragment extraction. [sent-623, score-0.364]
wordName wordTfidf (topN-words)
[('msa', 0.524), ('paraphrase', 0.398), ('fragment', 0.278), ('paraphrases', 0.254), ('house', 0.18), ('discourse', 0.168), ('vec', 0.141), ('fragments', 0.138), ('alignment', 0.113), ('recap', 0.109), ('recaps', 0.109), ('coreference', 0.098), ('paraphrasecoll', 0.094), ('parallel', 0.086), ('resolution', 0.08), ('foreman', 0.078), ('taub', 0.078), ('sentential', 0.069), ('barzilay', 0.069), ('vp', 0.064), ('costs', 0.063), ('tells', 0.063), ('regneri', 0.063), ('thirteen', 0.061), ('episodes', 0.061), ('productivity', 0.061), ('bleu', 0.06), ('dolan', 0.06), ('event', 0.058), ('cluster', 0.058), ('precision', 0.055), ('matched', 0.052), ('chris', 0.049), ('extraction', 0.048), ('aligned', 0.048), ('comparable', 0.048), ('summaries', 0.047), ('cgap', 0.047), ('hadley', 0.047), ('munteanu', 0.047), ('matching', 0.046), ('quirk', 0.045), ('paraphrasing', 0.045), ('msr', 0.042), ('lcs', 0.042), ('shinyama', 0.042), ('sentence', 0.041), ('rui', 0.04), ('thater', 0.04), ('cameron', 0.04), ('texts', 0.038), ('manfred', 0.036), ('rachel', 0.036), ('overlap', 0.036), ('giza', 0.036), ('baselines', 0.036), ('pairs', 0.035), ('pronoun', 0.035), ('system', 0.035), ('similarity', 0.035), ('szpektor', 0.034), ('restrictive', 0.034), ('agreement', 0.034), ('labelled', 0.033), ('unrelated', 0.033), ('mt', 0.033), ('gap', 0.033), ('wilson', 0.032), ('wang', 0.032), ('tv', 0.032), ('chunker', 0.031), ('cken', 0.031), ('dinner', 0.031), ('dinu', 0.031), ('durbin', 0.031), ('exemplary', 0.031), ('interchangeable', 0.031), ('lydia', 0.031), ('michaela', 0.031), ('morphine', 0.031), ('nash', 0.031), ('needleman', 0.031), ('season', 0.031), ('tomadaki', 0.031), ('tucker', 0.031), ('mckeown', 0.031), ('similarities', 0.03), ('reaches', 0.03), ('informed', 0.03), ('pivot', 0.03), ('extracting', 0.03), ('pipeline', 0.029), ('cm', 0.029), ('entailment', 0.029), ('extract', 0.029), ('monolingual', 0.029), ('preprocessing', 0.029), ('irrelevant', 0.028), ('stefan', 0.028), ('clustering', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999785 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
2 0.40558675 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
Author: Aurelien Max ; Houda Bouamor ; Anne Vilnat
Abstract: This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphrase recognition. A detailed quantified typology of subsentential paraphrases found in our corpus types is given.
3 0.35527349 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
Author: Atsushi Fujita ; Pierre Isabelle ; Roland Kuhn
Abstract: This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. Remaining noise can be further reduced by filtering seed paraphrases.
4 0.2074708 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
Author: Michael Roth ; Anette Frank
Abstract: Generating coherent discourse is an important aspect in natural language generation. Our aim is to learn factors that constitute coherent discourse from data, with a focus on how to realize predicate-argument structures in a model that exceeds the sentence level. We present an important subtask for this overall goal, in which we align predicates across comparable texts, admitting partial argument structure correspondence. The contribution of this work is two-fold: We first construct a large corpus resource of comparable texts, including an evaluation set with manual predicate alignments. Secondly, we present a novel approach for aligning predicates across comparable texts using graph-based clustering with Mincuts. Our method significantly outperforms other alignment techniques when applied to this novel alignment task, by a margin of at least 6.5 percentage points in F1-score.
5 0.14258482 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes
Author: Jong-Hoon Oh ; Kentaro Torisawa ; Chikara Hashimoto ; Takuya Kawada ; Stijn De Saeger ; Jun'ichi Kazama ; Yiou Wang
Abstract: In this paper we explore the utility of sentiment analysis and semantic word classes for improving why-question answering on a large-scale web corpus. Our work is motivated by the observation that a why-question and its answer often follow the pattern that if something undesirable happens, the reason is also often something undesirable, and if something desirable happens, the reason is also often something desirable. To the best of our knowledge, this is the first work that introduces sentiment analysis to non-factoid question answering. We combine this simple idea with semantic word classes for ranking answers to why-questions and show that on a set of 850 why-questions our method gains 15.2% improvement in precision at the top-1 answer over a baseline state-of-the-art QA system that achieved the best performance in a shared task of Japanese non-factoid QA in NTCIR-6.
6 0.13553201 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
7 0.11316119 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage
8 0.10655542 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
9 0.089516841 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis
10 0.087243013 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
11 0.080479644 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence
12 0.078629822 112 emnlp-2012-Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge
13 0.076512173 72 emnlp-2012-Joint Inference for Event Timeline Construction
14 0.074628934 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
15 0.074490339 73 emnlp-2012-Joint Learning for Coreference Resolution with Markov Logic
16 0.068552263 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
17 0.067403823 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
18 0.066203877 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence
19 0.053415697 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
20 0.052428011 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge
topicId topicWeight
[(0, 0.252), (1, 0.05), (2, -0.421), (3, -0.041), (4, 0.23), (5, 0.307), (6, 0.059), (7, -0.231), (8, -0.139), (9, 0.071), (10, 0.133), (11, -0.013), (12, 0.054), (13, -0.05), (14, 0.057), (15, 0.07), (16, -0.001), (17, 0.059), (18, -0.076), (19, 0.03), (20, 0.058), (21, 0.009), (22, 0.009), (23, -0.04), (24, 0.055), (25, -0.017), (26, 0.069), (27, 0.069), (28, 0.027), (29, -0.031), (30, 0.008), (31, -0.025), (32, 0.045), (33, -0.047), (34, 0.029), (35, 0.008), (36, 0.02), (37, -0.02), (38, 0.035), (39, -0.029), (40, -0.012), (41, -0.034), (42, 0.067), (43, 0.005), (44, -0.025), (45, 0.051), (46, 0.067), (47, -0.032), (48, 0.018), (49, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.95462424 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
Author: Aurelien Max ; Houda Bouamor ; Anne Vilnat
Abstract: This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphrase recognition. A detailed quantified typology of subsentential paraphrases found in our corpus types is given.
3 0.84679955 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
Author: Atsushi Fujita ; Pierre Isabelle ; Roland Kuhn
Abstract: This paper presents a paraphrase acquisition method that uncovers and exploits generalities underlying paraphrases: paraphrase patterns are first induced and then used to collect novel instances. Unlike existing methods, ours uses both bilingual parallel and monolingual corpora. While the former are regarded as a source of high-quality seed paraphrases, the latter are searched for paraphrases that match patterns learned from the seed paraphrases. We show how one can use monolingual corpora, which are far more numerous and larger than bilingual corpora, to obtain paraphrases that rival in quality those derived directly from bilingual corpora. In our experiments, the number of paraphrase pairs obtained in this way from monolingual corpora was a large multiple of the number of seed paraphrases. Human evaluation through a paraphrase substitution test demonstrated that the newly acquired paraphrase pairs are ofreasonable quality. Remaining noise can be further reduced by filtering seed paraphrases.
4 0.5895704 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
Author: Michael Roth ; Anette Frank
Abstract: Generating coherent discourse is an important aspect in natural language generation. Our aim is to learn factors that constitute coherent discourse from data, with a focus on how to realize predicate-argument structures in a model that exceeds the sentence level. We present an important subtask for this overall goal, in which we align predicates across comparable texts, admitting partial argument structure correspondence. The contribution of this work is two-fold: We first construct a large corpus resource of comparable texts, including an evaluation set with manual predicate alignments. Secondly, we present a novel approach for aligning predicates across comparable texts using graph-based clustering with Mincuts. Our method significantly outperforms other alignment techniques when applied to this novel alignment task, by a margin of at least 6.5 percentage points in F1-score.
5 0.29137641 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
Author: Heeyoung Lee ; Marta Recasens ; Angel Chang ; Mihai Surdeanu ; Dan Jurafsky
Abstract: We introduce a novel coreference resolution system that models entities and events jointly. Our iterative method cautiously constructs clusters of entity and event mentions using linear regression to model cluster merge operations. As clusters are built, information flows between entity and event clusters through features that model semantic role dependencies. Our system handles nominal and verbal events as well as entities, and our joint formulation allows information from event coreference to help entity coreference, and vice versa. In a cross-document domain with comparable documents, joint coreference resolution performs significantly better (over 3 CoNLL F1 points) than two strong baselines that resolve entities and events separately.
6 0.28454313 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis
7 0.27462891 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
8 0.26601261 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage
9 0.26358068 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
10 0.24362351 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes
11 0.24323516 112 emnlp-2012-Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge
12 0.23536599 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
13 0.23275697 73 emnlp-2012-Joint Learning for Coreference Resolution with Markov Logic
14 0.2284611 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
15 0.22339526 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
16 0.22202051 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence
17 0.22156718 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings
18 0.21157208 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
19 0.20257691 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
20 0.20244975 72 emnlp-2012-Joint Inference for Event Timeline Construction
topicId topicWeight
[(2, 0.014), (11, 0.012), (14, 0.012), (16, 0.036), (25, 0.015), (34, 0.058), (45, 0.314), (60, 0.144), (63, 0.057), (64, 0.034), (65, 0.021), (70, 0.023), (73, 0.013), (74, 0.054), (76, 0.042), (80, 0.027), (86, 0.025), (95, 0.025)]
simIndex simValue paperId paperTitle
1 0.86161542 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
Author: William Blacoe ; Mirella Lapata
Abstract: In this paper we address the problem of modeling compositional meaning for phrases and sentences using distributional methods. We experiment with several possible combinations of representation and composition, exhibiting varying degrees of sophistication. Some are shallow while others operate over syntactic structure, rely on parameter learning, or require access to very large corpora. We find that shallow approaches are as good as more computationally intensive alternatives with regards to two particular tests: (1) phrase similarity and (2) paraphrase detection. The sizes of the involved training corpora and the generated vectors are not as important as the fit between the meaning representation and compositional method.
same-paper 2 0.80047882 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
3 0.75508136 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
Author: Kewei Tu ; Vasant Honavar
Abstract: We introduce a novel approach named unambiguity regularization for unsupervised learning of probabilistic natural language grammars. The approach is based on the observation that natural language is remarkably unambiguous in the sense that only a tiny portion of the large number of possible parses of a natural language sentence are syntactically valid. We incorporate an inductive bias into grammar learning in favor of grammars that lead to unambiguous parses on natural language sentences. The resulting family of algorithms includes the expectation-maximization algorithm (EM) and its variant, Viterbi EM, as well as a so-called softmax-EM algorithm. The softmax-EM algorithm can be implemented with a simple and computationally efficient extension to standard EM. In our experiments of unsupervised dependency grammar learn- ing, we show that unambiguity regularization is beneficial to learning, and in combination with annealing (of the regularization strength) and sparsity priors it leads to improvement over the current state of the art.
4 0.74438894 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
5 0.57015026 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
Author: Dan Garrette ; Jason Baldridge
Abstract: Past work on learning part-of-speech taggers from tag dictionaries and raw data has reported good results, but the assumptions made about those dictionaries are often unrealistic: due to historical precedents, they assume access to information about labels in the raw and test sets. Here, we demonstrate ways to learn hidden Markov model taggers from incomplete tag dictionaries. Taking the MINGREEDY algorithm (Ravi et al., 2010) as a starting point, we improve it with several intuitive heuristics. We also define a simple HMM emission initialization that takes advantage of the tag dictionary and raw data to capture both the openness of a given tag and its estimated prevalence in the raw data. Altogether, our augmentations produce improvements to per- formance over the original MIN-GREEDY algorithm for both English and Italian data.
6 0.54322994 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
7 0.5367862 61 emnlp-2012-Grounded Models of Semantic Representation
8 0.53501409 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
9 0.53475398 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
10 0.5338276 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
11 0.52789348 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
12 0.52662688 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
13 0.52603889 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
14 0.52209705 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
15 0.52087462 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
16 0.51869738 81 emnlp-2012-Learning to Map into a Universal POS Tagset
17 0.51866442 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
18 0.51712084 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
19 0.51217574 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
20 0.51183236 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews