acl acl2011 acl2011-225 knowledge-graph by maker-knowledge-mining

225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs


Source: pdf

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. [sent-4, score-0.917]

2 In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. [sent-5, score-0.68]

3 We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources. [sent-6, score-0.111]

4 1 Introduction The acquisition of subsentential paraphrases has attracted a lot of attention recently (Madnani and Dorr, 2010). [sent-7, score-0.506]

5 These approaches face two main issues, which correspond to the typical measures of precision, or how appropriate the extracted paraphrases are, and of recall, or how many of the paraphrases present in a given corpus can be found effectively. [sent-9, score-0.574]

6 To start with, both measures are often hard to compute in practice, as 1) the definition of what makes an acceptable paraphrase pair is still a research question, and 2) it is often impractical to extract a complete set of acceptable paraphrases 395 from most resources. [sent-10, score-1.072]

7 Second, as regards the precision of paraphrase acquisition techniques in particular, it is notable that most works on paraphrase acquisition are not based on direct observation of larger paraphrase pairs. [sent-11, score-2.347]

8 Even monolingual corpora obtained by pairing very closely related texts such as news headlines on the same topic and from the same time frame (Dolan et al. [sent-12, score-0.132]

9 , 2004) often contain unre- lated segments that should not be aligned to form a subsentential paraphrase pair. [sent-13, score-0.83]

10 Using bilingual corpora to acquire paraphrases indirectly by pivoting through other languages is faced, in particular, with the issue of phrase polysemy, both in the source and in the pivot languages. [sent-14, score-0.408]

11 It has previously been noted that highly parallel monolingual corpora, typically obtained via multiple translation into the same language, constitute the most appropriate type of corpus for extracting high quality paraphrases, in spite of their rareness (Barzilay and McKeown, 2001; Cohn et al. [sent-15, score-0.231]

12 We build on this claim here to propose an original approach for the task of subsentential alignment based on the computation of a minimum edit rate between two sentential paraphrases. [sent-18, score-0.674]

13 More precisely, we concentrate on the alignment of atomic paraphrase pairs (Cohn et al. [sent-19, score-0.993]

14 , 2008), where the words from both paraphrases are aligned as a whole to the words of the other paraphrase, as opposed to composite paraphrase pairs obtained by joining together adjacent paraphrase pairs or possibly adding unaligned words. [sent-20, score-1.919]

15 Figure 1 provides examples of atomic paraphrase pairs de- rived from a word alignment between two English sentential paraphrases. [sent-21, score-1.182]

16 China ↔ China) will never be considered in tphaiisr sw (oer. [sent-27, score-0.023]

17 We first briefly describe in section 2 how we apply edit rate computation to the task of atomic paraphrase alignment, and we explain in section 3 how we can inform such a technique with paraphrase candidates extracted by additional techniques. [sent-31, score-1.903]

18 2 Edit rate for paraphrase alignment TER-PLUS (Translation Edit Rate Plus) (Snover et al. [sent-33, score-0.829]

19 Its typical use takes a system hypothesis to compute an optimal set of word edits that can transform it into some existing reference translation. [sent-35, score-0.062]

20 Edit types include exact word matching, word insertion and deletion, block movement of contiguous words (computed as an approximation), as well as variants substitution through stemming, synonym or paraphrase matching. [sent-36, score-0.743]

21 Each edit type is parameterized by at least one weight which can be optimized using e. [sent-37, score-0.151]

22 We will henceforth design as TERMT the TER metric (basically, without variants matching) optimized for correlation with human judgment of accuracy in MT evaluation, which is to date one of the most used metrics for this task. [sent-41, score-0.081]

23 396 While this metric was not designed explicitely for the acquisition of word alignments, it produces as a by-product of its approximate search a list of alignments involving either individual words or phrases, potentially fitting with the previous definition of atomic paraphrase pairs. [sent-42, score-1.071]

24 When applying it on a MT system hypothesis and a reference translation, it computes how much effort would be needed to obtain the reference from the hypothesis, possibly independently of the appropriateness of the alignments produced. [sent-43, score-0.225]

25 However, if we consider instead a pair of sentential paraphrases, it can be used to reveal what subsentential units can be aligned. [sent-44, score-0.35]

26 Intuitively, the more parallel two sentential paraphrases are, the more atomic paraphrase pairs will be reliably found, and the easier it will be for TER-PLUS to correctly identify the remaining pairs. [sent-47, score-1.458]

27 But in the general case, and considering less apparently parallel sentence pairs, its work can be facilitated by the incorporation of candidate paraphrase pairs in its paraphrase table. [sent-48, score-1.524]

28 We consider this possible type of hybridation in the next section. [sent-49, score-0.104]

29 3 Informing edit rate computation with other techniques In this article, we use three baseline techniques for paraphrase pair acquisition, which we will only briefly introduce (see (Bouamor et al. [sent-50, score-1.138]

30 As explained previously, we want to evaluate whether and how their candidate paraphrase pairs can be used to improve paraphrase acquisition on sentential paraphrases using TER-PLUS. [sent-52, score-2.04]

31 We selected these three techniques for the complementarity of types of information that they use: statistical word alignment without a priori linguistic knowledge, symbolic expression of linguistic variation exploiting a priori linguistic knowledge, and syntactic similarity. [sent-53, score-0.278]

32 Statistical Word Alignment The GIZA++ tool (Och and Ney, 2004) computes statistical word alignment models of increasing complexity from parallel corpora. [sent-54, score-0.141]

33 While originally developped in the bilingual context of Machine Translation, nothing prevents building such models on monolingual corpora. [sent-55, score-0.165]

34 This constitutes an advantage for this technique that the following techniques working on each sentence pair independently do not have. [sent-58, score-0.17]

35 Symbolic expression of linguistic variation The FASTR tool (Jacquemin, 1999) was designed to spot term variants in large corpora. [sent-59, score-0.076]

36 Variants are described through metarules expressing how the morphosyntactic structure of a term variant can be derived from a given term by means of regular expressions on word categories. [sent-60, score-0.05]

37 Paradigmatic varia- tion can also be expressed by defining constraints between words to force them to belong to the same morphological or semantic family, both constraints relying on preexisting repertoires available for English and French. [sent-61, score-0.023]

38 To compute candidate paraphrase pairs using FASTR, we first consider all the phrases from the first sentence and search for variants in the other sentence, do the reverse process and take the intersection of the two sets. [sent-62, score-0.841]

39 (2003) takes two sentences as input and merges them by top-down syntactic fusion guided by compatible syntactic substructure. [sent-64, score-0.03]

40 A lexical blocking mechanism prevents sentence constituents from fusionning when there is evidence of the presence of a word in another constituent of one of the sentence. [sent-65, score-0.027]

41 Because this process is highly sensitive to syntactic parse errors, we use k-best parses (with k = 3 in our experiments) and 397 retain the most compact fusion from any pair of can- didate parses. [sent-68, score-0.076]

42 (2008) for constructing evaluation corpora and assessing the performance ofvarious techniques on the task of paraphrase acquisition. [sent-70, score-0.822]

43 Techniques output wa goirdve alignments fernoomte wdh asich R a. [sent-72, score-0.079]

44 to Tmeicch ncaiqnudei-s date paraphrase pairs, denoted as Hatom, as well as composite paraphrase pairs, dde ansot Hed as H, can be ecoxmtrapcotseidt. [sent-73, score-1.475]

45 In each case, a heldout development corpus of 150 paraphrase pairs was used for tuning the TERP hybrid systems towards precision (→ p), recall (→ r), or F-measure (→ pf1re). [sent-75, score-0.852]

46 c1i sAiol n techniques were →evalu ra),te odr on mthee same (te→st set consisting of 375 paraphrase pairs. [sent-76, score-0.768]

47 We used as our reference set both the alignments marked as “Sure” and “Possible”. [sent-81, score-0.113]

48 html Figure 2: Results on the test set on French and English for the individual techniques and TERP hybrid systems. [sent-90, score-0.143]

49 Column headers of the form “→ c” indicate that TERP was tuned on criterion c. [sent-91, score-0.024]

50 figures reveal that the French corpus tends to contain more literal translations, possibly due to the original languages of the sentences, which are closer to the target language than Chinese is to English. [sent-92, score-0.064]

51 We used the YAWAT (Germann, 2008) interactive alignment tool and measure inter-annotator agreement over a subset and found it to be similar to the value reported by Cohn et al. [sent-93, score-0.076]

52 Results for all individual techniques in the two languages are given on Figure 2. [sent-95, score-0.115]

53 We first note that all techniques fared better on the French corpus than on the English corpus. [sent-96, score-0.076]

54 This can certainly be explained by the fact that the former results from more literal translations, which are consequently easier to word-align. [sent-97, score-0.057]

55 TER tuned for Machine Transla- tion evaluation) performs significantly worse on all metrics for both languages than our tuned TERP experiments, revealing that the two tasks have different objectives. [sent-100, score-0.048]

56 GIZA++ and TERPpara perform in the same range, with acceptable precision and recall, TERPpara performing overall better, with e. [sent-102, score-0.068]

57 Recall that TERP works independently on each paraphrase pair, while GIZA++ makes use of 398 artificial repetitions of paraphrases of the same sentence. [sent-107, score-1.003]

58 Figure 3 gives an indication of how well each technique performs depending on the difficulty of the task, which we estimate here as the value (1 − TER(para1 , para2)), whose low values correspond t(op asreantences which are costly to transform into the other using TER. [sent-108, score-0.072]

59 Not surprisingly, TERPpara and GIZA++, and PANG to a lesser extent, perform better on “more parallel” sentential paraphrase pairs. [sent-109, score-0.911]

60 Conversely, FASTR is not affected by the degree of parallelism between sentences, and manages to extract synonyms and more generally term variants, at any level of difficulty. [sent-110, score-0.025]

61 We have further tested 4 hybrid configurations by providing TERPpara with the output of the other individual techniques and of their union, the latter simply obtained by taking paraphrase pairs output by at least one of these techniques. [sent-111, score-0.91]

62 On French, where individual techniques achieve good performance, any hybridation improves the F-measure over both TERPpara and the technique used, the best performance, using FASTR, corresponding to an improvement of respectively +2. [sent-112, score-0.266]

63 9 Difficulty (1-TER) (a) French (b) English Figure 3: F-measure values for our 4 individual techniques on French and English depending on the complexity of paraphrase pairs measured with the (1-TER) formula. [sent-124, score-0.882]

64 Note that each value corresponds to the average of F-measure values for test examples falling in a given difficulty range, and that all ranges do not necessarily contain the same number of examples. [sent-125, score-0.025]

65 4 Successful hybridation on English seem harder to obtain, which may be partly attributed to the poor quality of the individual techniques relative to TERPpara. [sent-127, score-0.219]

66 This confirms that some types of linguistic equivalences cannot be captured using edit rate computation alone, even on this type of corpus. [sent-130, score-0.271]

67 5 Conclusion and future work In this article, we have described the use of edit rate computation for paraphrase alignment at the subsentential level from sentential paraphrases and the possibility of informing this search with paraphrase candidates coming from other techniques. [sent-131, score-2.43]

68 Our experiments have shown that in some circumstances some techniques have a good complementarity and manage to improve results significantly. [sent-132, score-0.111]

69 We are currently studying hard-to-align subsentential paraphrases from the type of corpora we used in order to get a better understanding of the types of knowledge required to improve automatic acquisition of these units. [sent-133, score-0.56]

70 4Indeed, measuring the precision on the union yields a poor performance of 23. [sent-134, score-0.07]

71 Similarly, the maximum value for precision with a good recall can be obtained by taking the intersection of the results of TERPpara and GIZA++, which yields a value of 60. [sent-137, score-0.08]

72 399 Our future work also includes the acquisition of paraphrase patterns (e. [sent-139, score-0.773]

73 , 2008)) to generalize the acquired equivalence units to more contexts, which could be both used in applications and to attempt improving further paraphrase acquisition techniques. [sent-142, score-0.8]

74 Integrating the use of patterns within an edit rate computation technique will however raise new difficulties. [sent-143, score-0.318]

75 We are finally also in the process of conducting a careful study of the characteristics of the paraphrase pairs that each technique can extract with high confidence, so that we can improve our hybridation experiments by considering confidence values at the paraphrase level using Machine Learning. [sent-144, score-1.61]

76 This way, we may be able to use an edit rate computation algorithm such as TER-PLUS as a more efficient system combiner for paraphrase extraction methods than what was proposed here. [sent-145, score-0.986]

77 A potential application of this would be an alternative proposal to the paraphrase evaluation metric PARAMETRIC (Callison-Burch et al. [sent-146, score-0.722]

78 , 2008), where individual techniques, outputing word alignments or not, could be evaluated from the ability of the informated edit rate technique to use correct equivalence units. [sent-147, score-0.404]

79 Constructing corpora for the development and evaluation of paraphrase systems. [sent-173, score-0.746]

80 Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. [sent-177, score-0.365]

81 Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. [sent-181, score-0.757]

82 Syntax-based alignement of multiple translations: Extracting paraphrases and generating new sentences. [sent-206, score-0.287]

83 TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate. [sent-215, score-0.076]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('paraphrase', 0.692), ('paraphrases', 0.287), ('terppara', 0.207), ('sentential', 0.189), ('edit', 0.151), ('atomic', 0.15), ('subsentential', 0.138), ('terp', 0.137), ('fastr', 0.129), ('french', 0.117), ('bouamor', 0.104), ('hybridation', 0.104), ('cohn', 0.084), ('acquisition', 0.081), ('alignments', 0.079), ('monolingual', 0.078), ('alignment', 0.076), ('techniques', 0.076), ('pairs', 0.075), ('aur', 0.068), ('lien', 0.068), ('parallel', 0.065), ('giza', 0.062), ('rate', 0.061), ('computation', 0.059), ('tunable', 0.056), ('pang', 0.054), ('corpora', 0.054), ('houda', 0.052), ('termt', 0.052), ('yawat', 0.052), ('variants', 0.051), ('madnani', 0.048), ('technique', 0.047), ('composite', 0.044), ('paradigmatic', 0.042), ('individual', 0.039), ('ger', 0.037), ('informing', 0.037), ('snover', 0.037), ('bilingual', 0.037), ('max', 0.037), ('union', 0.037), ('candito', 0.036), ('acceptable', 0.035), ('symbolic', 0.035), ('complementarity', 0.035), ('reference', 0.034), ('extracting', 0.034), ('literal', 0.033), ('precision', 0.033), ('bannard', 0.032), ('anne', 0.032), ('chris', 0.032), ('ter', 0.032), ('barzilay', 0.032), ('parametric', 0.031), ('translation', 0.031), ('possibly', 0.031), ('valletta', 0.03), ('lesser', 0.03), ('dolan', 0.03), ('metric', 0.03), ('fusion', 0.03), ('nitin', 0.03), ('pivot', 0.03), ('inform', 0.028), ('edits', 0.028), ('hybrid', 0.028), ('priori', 0.028), ('equivalence', 0.027), ('prevents', 0.027), ('mirella', 0.025), ('trevor', 0.025), ('term', 0.025), ('coming', 0.025), ('difficulty', 0.025), ('independently', 0.024), ('denoted', 0.024), ('del', 0.024), ('explained', 0.024), ('recall', 0.024), ('tuned', 0.024), ('english', 0.023), ('pair', 0.023), ('candidates', 0.023), ('intersection', 0.023), ('dsa', 0.023), ('rareness', 0.023), ('ata', 0.023), ('appropriateness', 0.023), ('mtc', 0.023), ('tphaiisr', 0.023), ('dde', 0.023), ('icetal', 0.023), ('didate', 0.023), ('preexisting', 0.023), ('joining', 0.023), ('developped', 0.023), ('combiner', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

2 0.49668866 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web

Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi

Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108

3 0.45363155 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

Author: Donald Metzler ; Eduard Hovy ; Chunliang Zhang

Abstract: Paraphrase generation is an important task that has received a great deal of interest recently. Proposed data-driven solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that rely on numerous language-dependent resources. Despite all of the attention, there have been very few direct empirical evaluations comparing the merits of the different approaches. This paper empirically examines the tradeoffs between simple and sophisticated paraphrase harvesting approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that very simple approaches fare surprisingly well and have a number of distinct advantages, including strong precision, good coverage, and low redundancy.

4 0.36991879 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

5 0.32763538 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.

6 0.13031662 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

7 0.12350806 333 acl-2011-Web-Scale Features for Full-Scale Parsing

8 0.11603864 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

9 0.098822623 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

10 0.093352765 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

11 0.092702478 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

12 0.083449103 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

13 0.081444398 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

14 0.075274765 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

15 0.074777216 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

16 0.072888069 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

17 0.068500936 141 acl-2011-Gappy Phrasal Alignment By Agreement

18 0.067184955 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

19 0.064849578 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

20 0.062455617 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.187), (1, -0.081), (2, 0.034), (3, 0.166), (4, 0.019), (5, 0.043), (6, 0.267), (7, 0.015), (8, 0.033), (9, -0.484), (10, -0.016), (11, 0.351), (12, 0.012), (13, 0.035), (14, 0.142), (15, 0.035), (16, -0.08), (17, 0.068), (18, 0.008), (19, 0.088), (20, -0.076), (21, -0.057), (22, 0.048), (23, 0.017), (24, 0.015), (25, 0.008), (26, 0.077), (27, 0.002), (28, -0.024), (29, -0.1), (30, -0.087), (31, 0.036), (32, -0.047), (33, 0.068), (34, 0.034), (35, 0.005), (36, -0.029), (37, 0.037), (38, -0.013), (39, -0.082), (40, -0.01), (41, -0.02), (42, 0.007), (43, 0.046), (44, 0.031), (45, -0.025), (46, -0.043), (47, -0.012), (48, 0.003), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95650268 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

2 0.90747821 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

Author: Donald Metzler ; Eduard Hovy ; Chunliang Zhang

Abstract: Paraphrase generation is an important task that has received a great deal of interest recently. Proposed data-driven solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that rely on numerous language-dependent resources. Despite all of the attention, there have been very few direct empirical evaluations comparing the merits of the different approaches. This paper empirically examines the tradeoffs between simple and sophisticated paraphrase harvesting approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that very simple approaches fare surprisingly well and have a number of distinct advantages, including strong precision, good coverage, and low redundancy.

3 0.87323105 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web

Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi

Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108

4 0.78997588 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

5 0.65143359 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.

6 0.46534121 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

7 0.38267097 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

8 0.34404156 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

9 0.31281394 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

10 0.28642949 74 acl-2011-Combining Indicators of Allophony

11 0.28562146 115 acl-2011-Engkoo: Mining the Web for Language Learning

12 0.27892375 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

13 0.27667439 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

14 0.262063 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction

15 0.23780932 333 acl-2011-Web-Scale Features for Full-Scale Parsing

16 0.22745012 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

17 0.22160493 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

18 0.21805595 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

19 0.2129575 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output

20 0.21257664 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.03), (13, 0.146), (17, 0.053), (26, 0.039), (31, 0.011), (37, 0.065), (39, 0.049), (41, 0.045), (53, 0.123), (55, 0.019), (59, 0.043), (72, 0.048), (73, 0.01), (75, 0.012), (91, 0.047), (96, 0.146), (97, 0.011), (98, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85594875 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

Author: Balaji Soundrarajan ; Thomas Ginter ; Scott DuVall

Abstract: This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. The general functionality of the interface is described in relation to the use cases that necessitated its creation. 1

same-paper 2 0.82966799 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

3 0.82104862 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web

Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi

Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108

4 0.80978119 159 acl-2011-Identifying Noun Product Features that Imply Opinions

Author: Lei Zhang ; Bing Liu

Abstract: Identifying domain-dependent opinion words is a key problem in opinion mining and has been studied by several researchers. However, existing work has been focused on adjectives and to some extent verbs. Limited work has been done on nouns and noun phrases. In our work, we used the feature-based opinion mining model, and we found that in some domains nouns and noun phrases that indicate product features may also imply opinions. In many such cases, these nouns are not subjective but objective. Their involved sentences are also objective sentences and imply positive or negative opinions. Identifying such nouns and noun phrases and their polarities is very challenging but critical for effective opinion mining in these domains. To the best of our knowledge, this problem has not been studied in the literature. This paper proposes a method to deal with the problem. Experimental results based on real-life datasets show promising results. 1

5 0.7951932 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

Author: Dipanjan Das ; Slav Petrov

Abstract: We describe a novel approach for inducing unsupervised part-of-speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), making it applicable to a wide array of resource-poor languages. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in an unsupervised model (BergKirkpatrick et al., 2010). Across eight European languages, our approach results in an average absolute improvement of 10.4% over a state-of-the-art baseline, and 16.7% over vanilla hidden Markov models induced with the Expectation Maximization algorithm.

6 0.79137003 66 acl-2011-Chinese sentence segmentation as comma classification

7 0.77389222 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

8 0.76805639 63 acl-2011-Bootstrapping coreference resolution using word associations

9 0.75778586 11 acl-2011-A Fast and Accurate Method for Approximate String Search

10 0.74490714 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

11 0.74479133 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

12 0.74141884 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

13 0.73227316 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

14 0.72412497 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models

15 0.7186355 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates

16 0.71688396 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

17 0.71567291 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

18 0.71233374 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

19 0.71165466 136 acl-2011-Finding Deceptive Opinion Spam by Any Stretch of the Imagination

20 0.70343912 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation