acl acl2011 acl2011-72 knowledge-graph by maker-knowledge-mining

72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation


Source: pdf

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. [sent-4, score-0.383]

2 We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. [sent-5, score-0.322]

3 The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. [sent-6, score-0.656]

4 In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments. [sent-7, score-0.135]

5 1 Introduction Machine paraphrasing has many applications for natural language processing tasks, including machine translation (MT), MT evaluation, summary evaluation, question answering, and natural language generation. [sent-8, score-0.214]

6 com Despite the similarities between paraphrasing and translation, several major differences have prevented researchers from simply following standards that have been established for machine translation. [sent-18, score-0.198]

7 Professional translators produce large volumes of bilingual data according to a more or less consistent specification, indirectly fueling work on machine translation algorithms. [sent-19, score-0.1]

8 Our work introduces two novel contributions which combine to address the challenges posed by paraphrase evaluation. [sent-24, score-0.338]

9 First, we describe a framework for easily and inexpensively crowdsourcing arbitrarily large training and test sets of independent, redundant linguistic descriptions of the same semantic content. [sent-25, score-0.432]

10 We believe that this metric, along with the sentence-level paraphrases provided by our data collection approach, will make it possiProce dinPgosrt olafn thde, 4 O9rtehg Aon ,n Ju anle M 1e9e-2tin4g, 2 o0f1 t1h. [sent-27, score-0.312]

11 Ac s2s0o1ci1a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 190–20 , ble for researchers working on paraphrasing to compare system performance and exploit the kind of automated, rapid training-test cycle that has driven work on Statistical Machine Translation. [sent-29, score-0.201]

12 In addition to describing a mechanism for collecting large-scale sentence-level paraphrases, we are also making available to the research community 85K parallel English sentences as part of the Microsoft Research Video Description Corpus 1. [sent-30, score-0.182]

13 Section 3 then describes our data collection framework and the resulting data. [sent-33, score-0.097]

14 Section 4 discusses automatic evaluations of paraphrases and introduces the novel metric PINC. [sent-34, score-0.314]

15 Section 5 presents experimental results establishing a correlation between our automatic metric and human judgments. [sent-35, score-0.187]

16 2 Related Work Since paraphrase data are not readily available, various methods have been used to extract parallel text from other sources. [sent-37, score-0.489]

17 Examples of this kind of data include the Multiple-Translation Chinese (MTC) Corpus 2 which consists of Chinese news stories translated into English by 11 translation agencies, and literary works with multiple translations into English (e. [sent-40, score-0.198]

18 ) Another method for collecting monolingual paraphrase data involves aligning semantically parallel sentences from different news articles describing the same event (Shinyama et al. [sent-43, score-0.646]

19 While utilizing multiple translations of literary work or multiple news stories of the same event can yield significant numbers of parallel sentences, this data tend to be noisy, and reliably identifying good paraphrases among all possible sentence pairs remains an open problem. [sent-46, score-0.558]

20 Finally, some approaches avoid the need for monolingual paraphrase data altogether by using a second language as the pivot language (Bannard and Callison-Burch, 2005; Callison-Burch, 2008; Kok and Brockett, 2010). [sent-51, score-0.381]

21 While most work on evaluating paraphrase systems has relied on human judges (Barzilay and McKeown, 2001 ; Ibrahim et al. [sent-54, score-0.391]

22 , 2006), there have also been a few attempts at creating automatic metrics that can be more easily replicated and used to compare different systems. [sent-56, score-0.104]

23 , 2008) compares the paraphrases discovered by an automatic system with ones annotated by humans, measuring precision and recall. [sent-58, score-0.255]

24 This approach requires additional human annotations to identify the paraphrases within parallel texts (Cohn et al. [sent-59, score-0.424]

25 , 2010) produces a single score that captures the semantic adequacy, fluency, and lexical dissimilarity of candidate paraphrases, relying on bilingual data to learn semantic equivalences without using n-gram similarity between candidate and reference sentences. [sent-62, score-0.125]

26 In addition, the metric was shown to correlate well with human judgments. [sent-63, score-0.112]

27 However, a significant drawback of this ap- proach is that PEM requires substantial in-domain bilingual data to train the semantic adequacy evaluator, as well as sample human judgments to train the overall metric. [sent-64, score-0.216]

28 We designed our data collection framework for use on crowdsourcing platforms such as Amazon’s Mechanical Turk. [sent-65, score-0.144]

29 Of particular relevance are the paraphrasing work by Buzek et al. [sent-67, score-0.157]

30 automatically identified problem regions in a translation task and had workers attempt to paraphrase them, while Denkowski et al. [sent-71, score-0.581]

31 asked workers to assess the validity of automatically extracted paraphrases. [sent-72, score-0.265]

32 Our work is distinct from these earlier efforts both in terms of the task attempting to collect linguistic descriptions using a visual stimulus and the dramatically larger scale of the data collected. [sent-73, score-0.306]

33 – – 3 Data Collection Since our goal was to collect large numbers of paraphrases quickly and inexpensively using a crowd, our framework was designed to make the tasks short, simple, easy, accessible and somewhat fun. [sent-74, score-0.412]

34 For each task, we asked the annotators to watch a very short video clip (usually less than 10 seconds long) and describe in one sentence the main action or event that occurred in the video clip We deployed the task on Amazon’s Mechanical Turk, with video segments selected from YouTube. [sent-75, score-1.55]

35 On average, annotators completed each task within 80 seconds, including the time required to watch the video. [sent-77, score-0.148]

36 The data thus has some similarities to parallel news descriptions of the same event, while avoiding much of the noise inherent in news. [sent-80, score-0.387]

37 Crucially, our approach allows us to gather arbitrarily many of these independent descriptions for each video, capturing nearly-exhaustive cover- age of how native speakers are likely to summarize a small action. [sent-82, score-0.269]

38 It might be possible to achieve similar effects using images or panels of images as the stimulus (von Ahn and Dabbish, 2004; Fei-Fei et al. [sent-83, score-0.111]

39 , 2010), but we believed that videos would be more engaging and less ambiguous in their focus. [sent-85, score-0.114]

40 In addition, videos have been shown to be more effective in prompting descriptions of motion and contact verbs, as well as verbs that are generally not imageable (Ma and Cook, 2009). [sent-86, score-0.348]

41 192 Watch and describe a short segment of a video You will be shown a segment of a video clip and asked to describe the main action/event in that segment in ONE SENTENCE. [sent-87, score-1.109]

42 Things to note while completing this task: The video will play only a selected segment by default. [sent-88, score-0.481]

43 You can choose to watch the entire clip and/or with sound although this is not necessary tPhleeva seide onol. [sent-89, score-0.177]

44 Figure 1: A screenshot of our annotation task as it was deployed on Mechanical Turk. [sent-97, score-0.115]

45 1 Quality Control One of the main problems with collecting data using a crowd is quality control. [sent-99, score-0.134]

46 While the cost is very low compared to traditional annotation methods, workers recruited over the Internet are often unqualified for the tasks or are incentivized to cheat in order to maximize their rewards. [sent-100, score-0.261]

47 To encourage native and fluent contributions, we asked annotators to write the descriptions in the language of their choice. [sent-101, score-0.382]

48 The idea was to reward workers who had shown the ability to write quality descriptions and the willingness to work on our tasks consistently. [sent-105, score-0.54]

49 While everyone had access to the Tier-1 tasks, only workers who had been manually qualified could work on the Tier-2 tasks. [sent-106, score-0.223]

50 The tasks were identical in the two tiers but each Tier-1 task only paid 1 cent while each Tier-2 task paid 5 cents, giving the workers a strong incentive to earn the qualification. [sent-107, score-0.227]

51 We periodically evaluated the workers who had submitted the most Tier-1 tasks (usually on the order of few hundred submissions) and granted them access to the Tier-2 tasks if they had performed well. [sent-109, score-0.341]

52 Moreover, the initial effort is amortized over time as these quality workers are retained over the entire duration of the data collection. [sent-114, score-0.219]

53 Many of them annotated all the available videos we had. [sent-115, score-0.114]

54 2 Video Collection To find suitable videos to annotate, we deployed a separate task. [sent-117, score-0.161]

55 Workers were asked to submit short (generally 4-10 seconds) video segments depicting single, unambiguous events by specifying links to YouTube videos, along with the start and end times. [sent-118, score-0.405]

56 We again used a tiered payment system to reward and retain workers who performed well. [sent-119, score-0.231]

57 Since the scope of this data collection effort extended beyond gathering English data alone, we 3Everyone who submitted descriptions in a foreign language was granted access to the Tier-2 tasks. [sent-120, score-0.364]

58 This was done to encourage more submissions in different languages and also because we could not verify the quality of those descriptions other than using online translation services (and some of the languages were not available to be translated). [sent-121, score-0.379]

59 3532 Table 4: Correlation between the human judges as well as between the automatic metrics and the human judges. [sent-131, score-0.155]

60 English sentence pairs and 2400 human ratings of paraphrase pairs), it is difficult to use PEM as a general metric. [sent-132, score-0.48]

61 Adapting PEM to a new domain would require sufficient in-domain bilingual data to support paraphrase extraction. [sent-133, score-0.381]

62 Moreover, PEM requires sample human ratings in training, thereby lessening the advantage of having automatic metrics. [sent-135, score-0.142]

63 Since lexical dissimilarity is only desirable when the semantics of the original sentence is unchanged, we also computed correlation between PINC and the human ratings when BLEU is above certain thresholds. [sent-136, score-0.299]

64 As we restrict our attention to the set of paraphrases with higher BLEU scores, we see an increase in correlation between PINC and the human assessments. [sent-137, score-0.383]

65 Finally, while we do not believe any single score could adequately describe the quality of a paraphrase outside of a specific application, we experimented with different ways of combining BLEU and PINC into a single score. [sent-139, score-0.371]

66 Almost any simple combination, such as taking the average of the two, yielded decent correlation with the human ratings. [sent-140, score-0.128]

67 a paraphrase either preserves the meaning or it does not, in which case PINC does not matter at all) than a linear function. [sent-150, score-0.338]

68 In practice, some sample human ratings would be required to tune this function. [sent-152, score-0.142]

69 We quantified the utility of our highly parallel data by computing the correlation between BLEU and human ratings when different numbers of references were available. [sent-154, score-0.366]

70 As the number of references increases, the correlation with human ratings also increases. [sent-156, score-0.217]

71 Ifwe are trying to assess the overall quality of the paraphrase, it is better to exclude the source sentence, since otherwise the metric will tend to favor paraphrases that introduce fewer changes. [sent-159, score-0.384]

72 3 Direct paraphrasing versus video annotation In addition to collecting paraphrases through video annotations, we also experimented with the more traditional task of presenting a sentence to an annotator and explicitly asking for a paraphrase. [sent-161, score-1.277]

73 We randomly selected a thousand sentences from our data and collected two paraphrases of each using Mechanical Turk. [sent-162, score-0.255]

74 We conducted a post-annotation survey of workers who had completed both the video description and the direct paraphrasing tasks, and found that paraphrasing was considered more difficult and less enjoyable than describing videos. [sent-163, score-0.944]

75 Of those surveyed, 92% found video annotations more enjoyable, and 75% found them easier. [sent-164, score-0.363]

76 Based on the comments, the only drawback of the video annotation task is the time required to load and watch the videos. [sent-165, score-0.485]

77 Overall, half of the workers preferred the video annotation task while only 16% of the workers preferred the paraphrasing task. [sent-166, score-0.926]

78 The data produced by the direct paraphrasing task also diverged less, since the annotators were inevitably biased by lexical choices and word order in the original sentences. [sent-167, score-0.217]

79 On average, a direct paraphrase had a PINC score of 70. [sent-168, score-0.338]

80 08, while a parallel description of the same video had a score of 78. [sent-169, score-0.515]

81 6 Discussions and Future Work While our data collection framework yields useful parallel data, it also has some limitations. [sent-171, score-0.213]

82 Finding appropriate videos is time-consuming and remains a bottleneck in the process. [sent-172, score-0.114]

83 One possible solution is to use longer video snippets or other visual stimuli such as graphs, schemas, or illustrated storybooks to convey more complicated information. [sent-174, score-0.431]

84 However, the increased complexity is also likely to reduce the semantic closeness of the parallel descriptions. [sent-175, score-0.116]

85 Asking annotators to write multiple descriptions or longer descriptions would result in more varied data but at the cost of more noise in the alignments. [sent-177, score-0.574]

86 However, as with the difficulty of aligning news stories, finding paraphrases within these more complex responses could require additional annotation efforts. [sent-182, score-0.326]

87 However, other more advanced MT metrics that have shown higher correlation with humanjudgments could also be used. [sent-188, score-0.124]

88 In addition to paraphrasing, our data collection framework could also be used to produces useful data for machine translation and computer vision. [sent-189, score-0.154]

89 By pairing up descriptions of the same video in different languages, we obtain parallel data without requiring any bilingual skills. [sent-190, score-0.756]

90 Another application for our data is to apply it to computer vision tasks such as video retrieval. [sent-191, score-0.404]

91 The dataset can be readily used to train and evaluate systems that can automatically generate full descriptions ofunseen videos. [sent-192, score-0.269]

92 As far as we know, there are currently no datasets that contain whole-sentence descriptions of open-domain video segments. [sent-193, score-0.632]

93 7 Conclusion We introduced a data collection framework that produces highly parallel data by asking different annotators to describe the same video segments. [sent-194, score-0.708]

94 Deploying the framework on Mechanical Turk over a two-month period yielded 85K English descriptions for 2K videos, one of the largest paraphrase data resources publicly available. [sent-195, score-0.612]

95 In addition, the highly parallel nature of the data allows us to use standard MT metrics such as BLEU to evaluate semantic adequacy reliably. [sent-196, score-0.285]

96 Can crowds build parallel corpora for machine translation systems? [sent-202, score-0.173]

97 Constructing corpora for the development and evaluation of paraphrase systems. [sent-252, score-0.338]

98 Exploring normalization techniques for human judgments of machine translation adequacy collected using Amazon Mechanical Turk. [sent-256, score-0.23]

99 Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. [sent-264, score-0.491]

100 Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. [sent-301, score-0.255]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('video', 0.363), ('paraphrase', 0.338), ('paraphrases', 0.255), ('pinc', 0.253), ('descriptions', 0.234), ('mechanical', 0.201), ('workers', 0.186), ('paraphrasing', 0.157), ('amazon', 0.134), ('pem', 0.134), ('bleu', 0.127), ('parallel', 0.116), ('videos', 0.114), ('chris', 0.106), ('ratings', 0.089), ('clip', 0.089), ('denkowski', 0.088), ('watch', 0.088), ('adequacy', 0.087), ('segment', 0.084), ('barzilay', 0.082), ('dissimilarity', 0.082), ('bloodgood', 0.082), ('inexpensively', 0.076), ('correlation', 0.075), ('stories', 0.071), ('buzek', 0.067), ('collecting', 0.066), ('bannard', 0.063), ('ibrahim', 0.062), ('annotators', 0.06), ('dolan', 0.06), ('metric', 0.059), ('collection', 0.057), ('translation', 0.057), ('please', 0.057), ('creating', 0.055), ('submissions', 0.055), ('human', 0.053), ('pork', 0.051), ('rashtchian', 0.051), ('semancc', 0.051), ('metrics', 0.049), ('shinyama', 0.049), ('turk', 0.048), ('deployed', 0.047), ('austin', 0.047), ('crowdsourcing', 0.047), ('event', 0.046), ('write', 0.046), ('enjoyable', 0.045), ('mtc', 0.045), ('payment', 0.045), ('qualification', 0.045), ('rapid', 0.044), ('bilingual', 0.043), ('naacl', 0.043), ('mckeown', 0.043), ('monolingual', 0.043), ('brockett', 0.043), ('asked', 0.042), ('tasks', 0.041), ('ambati', 0.041), ('band', 0.041), ('prevented', 0.041), ('hlt', 0.041), ('microsoft', 0.04), ('framework', 0.04), ('asking', 0.039), ('rising', 0.039), ('stimulus', 0.039), ('kok', 0.039), ('granted', 0.039), ('someone', 0.038), ('assess', 0.037), ('news', 0.037), ('everyone', 0.037), ('description', 0.036), ('images', 0.036), ('cohn', 0.035), ('complicated', 0.035), ('arbitrarily', 0.035), ('crowd', 0.035), ('callisonburch', 0.035), ('pear', 0.035), ('datasets', 0.035), ('quirk', 0.035), ('readily', 0.035), ('submitted', 0.034), ('zaidan', 0.034), ('completing', 0.034), ('irvine', 0.034), ('screenshot', 0.034), ('annotation', 0.034), ('quality', 0.033), ('mt', 0.033), ('highly', 0.033), ('translations', 0.033), ('judgments', 0.033), ('visual', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

2 0.36991879 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

3 0.36112565 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

Author: Donald Metzler ; Eduard Hovy ; Chunliang Zhang

Abstract: Paraphrase generation is an important task that has received a great deal of interest recently. Proposed data-driven solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that rely on numerous language-dependent resources. Despite all of the attention, there have been very few direct empirical evaluations comparing the merits of the different approaches. This paper empirically examines the tradeoffs between simple and sophisticated paraphrase harvesting approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that very simple approaches fare surprisingly well and have a number of distinct advantages, including strong precision, good coverage, and low redundancy.

4 0.32098758 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web

Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi

Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108

5 0.22584784 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.

6 0.21595559 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

7 0.13335183 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

8 0.13101551 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

9 0.12476856 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

10 0.12475086 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

11 0.10746945 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

12 0.10741486 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

13 0.10698386 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

14 0.10270482 264 acl-2011-Reordering Metrics for MT

15 0.085512616 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

16 0.082462564 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

17 0.080542587 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

18 0.08041881 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

19 0.07254114 333 acl-2011-Web-Scale Features for Full-Scale Parsing

20 0.071819991 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.227), (1, -0.06), (2, 0.033), (3, 0.186), (4, 0.002), (5, 0.092), (6, 0.214), (7, -0.015), (8, 0.093), (9, -0.399), (10, -0.074), (11, 0.182), (12, -0.003), (13, -0.054), (14, 0.105), (15, 0.043), (16, -0.102), (17, 0.054), (18, 0.022), (19, 0.067), (20, -0.038), (21, 0.024), (22, 0.052), (23, 0.022), (24, -0.009), (25, 0.012), (26, 0.04), (27, -0.031), (28, -0.001), (29, -0.066), (30, -0.081), (31, -0.002), (32, -0.067), (33, 0.111), (34, 0.057), (35, 0.05), (36, 0.019), (37, 0.052), (38, -0.025), (39, -0.103), (40, 0.062), (41, -0.066), (42, 0.042), (43, 0.032), (44, 0.036), (45, -0.023), (46, -0.052), (47, 0.047), (48, -0.018), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93186504 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

2 0.90519345 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

3 0.89676648 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

Author: Donald Metzler ; Eduard Hovy ; Chunliang Zhang

Abstract: Paraphrase generation is an important task that has received a great deal of interest recently. Proposed data-driven solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that rely on numerous language-dependent resources. Despite all of the attention, there have been very few direct empirical evaluations comparing the merits of the different approaches. This paper empirically examines the tradeoffs between simple and sophisticated paraphrase harvesting approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that very simple approaches fare surprisingly well and have a number of distinct advantages, including strong precision, good coverage, and low redundancy.

4 0.81081516 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web

Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi

Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108

5 0.60859293 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.

6 0.51156592 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

7 0.46598473 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

8 0.44999725 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

9 0.43657374 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

10 0.40966091 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output

11 0.3916041 216 acl-2011-MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

12 0.38470951 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability

13 0.38413769 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

14 0.36945042 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

15 0.36592478 264 acl-2011-Reordering Metrics for MT

16 0.36422974 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

17 0.34689841 74 acl-2011-Combining Indicators of Allophony

18 0.34588078 115 acl-2011-Engkoo: Mining the Web for Language Learning

19 0.33471438 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

20 0.33203375 252 acl-2011-Prototyping virtual instructors from human-human corpora


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.045), (17, 0.045), (26, 0.042), (31, 0.014), (37, 0.052), (39, 0.043), (41, 0.057), (53, 0.074), (55, 0.023), (59, 0.045), (72, 0.068), (73, 0.014), (75, 0.188), (91, 0.034), (96, 0.171)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93032753 303 acl-2011-Tier-based Strictly Local Constraints for Phonology

Author: Jeffrey Heinz ; Chetan Rawal ; Herbert G. Tanner

Abstract: Beginning with Goldsmith (1976), the phonological tier has a long history in phonological theory to describe non-local phenomena. This paper defines a class of formal languages, the Tier-based Strictly Local languages, which begin to describe such phenomena. Then this class is located within the Subregular Hierarchy (McNaughton and Papert, 1971). It is found that these languages contain the Strictly Local languages, are star-free, are incomparable with other known sub-star-free classes, and have other interesting properties.

2 0.88267225 299 acl-2011-The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.

3 0.84616613 113 acl-2011-Efficient Online Locality Sensitive Hashing via Reservoir Counting

Author: Benjamin Van Durme ; Ashwin Lall

Abstract: We describe a novel mechanism called Reservoir Counting for application in online Locality Sensitive Hashing. This technique allows for significant savings in the streaming setting, allowing for maintaining a larger number of signatures, or an increased level of approximation accuracy at a similar memory footprint.

same-paper 4 0.83196747 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

5 0.79450822 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan

Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.

6 0.75288284 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web

7 0.74704152 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

8 0.74288023 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

9 0.74052978 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

10 0.73679602 66 acl-2011-Chinese sentence segmentation as comma classification

11 0.73461199 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections

12 0.73411435 159 acl-2011-Identifying Noun Product Features that Imply Opinions

13 0.72619927 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

14 0.72403908 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

15 0.71774143 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

16 0.71704161 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

17 0.71449673 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

18 0.71402717 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

19 0.71004349 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models

20 0.70862234 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output