acl acl2011 acl2011-132 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi
Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108
Reference: text
sentIndex sentText sentNum sentScore
1 We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. [sent-4, score-0.66]
2 We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. [sent-6, score-0.589]
3 Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. [sent-7, score-0.442]
4 , 2006) and have tried to acquire a large amount of paraphrase knowledge, which is a key to achieving robust automatic paraphrasing, from corpora (Lin and Pantel, 2001 ;Barzilay and McKeown, 2001 ; Shinyama et al. [sent-12, score-0.486]
5 We propose a method to extract phrasal paraphrases from pairs of sentences that define the same 1087 kyot o-u . [sent-14, score-0.639]
6 This suggests that we may be able to extract a large amount of phrasal paraphrase knowledge from the definition sentences on the Web. [sent-19, score-0.872]
7 We define paraphrase as a pair of expressions between which entailment relations of both directions hold. [sent-33, score-0.758]
8 Our objective is to extract phrasal paraphrases from pairs of sentences that define the same concept. [sent-35, score-0.573]
9 On the contrary, recognizing definition sentences for the same concept is quite an easy task at least for Japanese, as we will show, and we were able to find a huge amount of definition sentence pairs from normal Web texts. [sent-45, score-0.791]
10 In our experiments, about 30 million definition sentence pairs were extracted from 6 108 dWefebin documents, aen pda ithrse w wesetriem exatterdac nteudm fbroerm o 6f paraphrases recognized in the definition sentences using our method was about 300,000, for a precision rate of about 94%. [sent-46, score-1.205]
11 Our evaluation is based on bidirectional check- ing of entailment relations between paraphrases that considers the context dependence of a paraphrase. [sent-49, score-0.512]
12 Note that using definition sentences is only the beginning of our research on paraphrase extraction. [sent-50, score-0.81]
13 2 Related Work The existing work for paraphrase extraction is categorized into two groups. [sent-61, score-0.486]
14 These methods can be applied to a normal monolingual corpus, and it has been shown that a large number of paraphrases or entailment rules could be extracted. [sent-67, score-0.469]
15 Another limitation of these methods is that they can find only paraphrases consisting of frequently observed expressions since they must have reliable distributional similarity values for expressions that constitute paraphrases. [sent-73, score-0.453]
16 We avoid this by using definition sentences, which can be easily acquired on a large scale from the Web, as parallel corpora. [sent-83, score-0.368]
17 (2004) used definition sentences in two manually compiled dictionaries, which are considerably fewer in the number of definition sentences than those on the Web. [sent-85, score-0.648]
18 3 Proposed method Our method, targeting the Japanese language, consists of two steps: definition sentence acquisition × and paraphrase extraction. [sent-89, score-0.87]
19 Although the particle sequence tends to mark the topic of the definition sentence, it can also appear in interrogative sentences and normal assertive sentences in which a topic is strongly emphasized. [sent-101, score-0.474]
20 After adding definition sentences from Wikipedia articles, which are typically the first sentence of the body of each article (Kazama and Torisawa, 2007), we obtained a total of 2,141,878 definition sentence candidates, which covered 867,321 concepts ranging from weapons to rules of baseball. [sent-119, score-0.688]
21 Then, we coupled two definition sentences whose defined concepts were the same and obtained 29,661,812 definition sentence pairs. [sent-120, score-0.633]
22 The extracted dependency tree fragments are called candidate phrases hereafter. [sent-126, score-0.392]
23 Then, we check all the pairs of candidate phrases between two definition sentences to find paraphrase pairs. [sent-129, score-1.188]
24 3 In (1), repeated in (3), candidate phrase pairs to be checked include (? [sent-130, score-0.393]
25 3Our method discards candidate phrase pairs in which one subsumes the other in terms of their character string, or the difference is only one proper noun like “toner cartridges that Apple Inc. [sent-153, score-0.459]
26 f1The ratio of the number of morphemes shared between two candidate phrases to the number of all of the morphemes in the two phrases. [sent-156, score-0.578]
27 f2The ratio of the number of a candidate phrase’s morphemes, for which there is a morpheme with small edit distance (1 in our experiment) in another candidate phrase, to the number of all of the morphemes in the two phrases. [sent-157, score-0.669]
28 f3The ratio of the number of a candidate phrase’s morphemes, for which there is a morpheme with the same pronunciation in another candidate phrase, to the number of all of the morphemes in the two phrases. [sent-159, score-0.765]
29 f4The ratio of the number of morphemes of a shorter candidate phrase to that of a longer one. [sent-162, score-0.455]
30 f8The ratio of the number of morphemes that appear in a candidate phrase segment of a definition sentence s1and in a segment that is NOT a part of the candidate phrase of another definition sentence s2 to the number of all of the morphemes of s1’s candidate phrase, i. [sent-166, score-1.684]
31 f10The ratio of the n↔um sber of parent dependency tree fragments that are shared by two candidate phrases to the number of all of the parent dependency tree fragments of the two phrases. [sent-170, score-0.554]
32 f13The ratio of the number of unigrams (morphemes) that appear in the child context of both candidate phrases to the number of all of the child context morphemes of both candidate phrases. [sent-174, score-0.742]
33 f16The ratio of the number of trigrams that appear in the child context of both candidate phrases to the number of all of the child context morphemes of both candidate phrases. [sent-178, score-0.742]
34 f17Cosine similarity between two definition sentences from which a candidate phrase pair is extracted. [sent-180, score-0.641]
35 f17 Table 1: Features used by paraphrase classifier. [sent-181, score-0.486]
36 The paraphrase checking of candidate phrase pairs is performed by an SVM classifier with a linear kernel that classifies each pair of candidate phrases into a paraphrase or a non-paraphrase. [sent-182, score-1.697]
37 Features for the classifier are based on our observation that two candidate phrases tend to be paraphrases if the candidate phrases themselves are sufficiently similar and/or their surround- ing contexts are sufficiently similar. [sent-184, score-0.878]
38 In the figure, f8 has a positive value since the candidate phrase of s1 contains morphemes “of bone”, which do not appear in the can4We use SVMperf available at http : / / svml ight . [sent-190, score-0.445]
39 5In the table, the parent context of a candidate phrase consists of expressions that appear in ancestor nodes of the candidate phrase in terms ofthe dependency structure ofthe sentence. [sent-194, score-0.659]
40 s2: Osteoporosis is a disease that decreases the quantity of bone and s02: Osteoporosis is a disease that reduces bone mass and increases increases the risk of bone fracture. [sent-205, score-1.34]
41 In preparing the training data, we faced the problem that the completely random sampling of candidate paraphrase pairs provided us with only a small number of positive examples. [sent-217, score-0.84]
42 Thus, we automatically collected candidate paraphrase pairs that were expected to have a high likelihood of being positive as examples to be labeled. [sent-218, score-0.84]
43 The likelihood was calculated by simply summing all of the 78 feature values that we have tried, since they indicate the likelihood of a given candidate paraphrase pair’s being a paraphrase. [sent-219, score-0.686]
44 Specifically, we first randomly sampled 30,000 definition sentence pairs from the 29,661,812 pairs, and collected 3,000 candidate phrase pairs that had the highest likelihood from them. [sent-221, score-0.892]
45 The manual labeling of each candidate phrase pair (p1,p2) was based on bidirectional checking of entailment relation, p1 → p2 and p2 → p1, with p1 and p2 embedded in con→text ps. [sent-222, score-0.538]
46 We adopt this scheme since paraphrase judgment might be unstable between annotators unless they are given a particular context based on which they make a judgment. [sent-225, score-0.522]
47 First, from each candidate phrase pair (p1,p2) and its source definition sentence pair (s1, s2), we create two paraphrase sentence pairs (s01 , s02) by exchanging p1 and p2 between s1 and s2. [sent-229, score-1.331]
48 In this example, both entailment relations, s01 and s2 → s02, hold, and thus the candidate phrase pair (p1,p2) sis judged as positive. [sent-232, score-0.465]
49 6 We built the paraphrase classifier from the training data. [sent-234, score-0.486]
50 As mentioned, candidate phrase pairs were ranked by the distance from the SVM’s hyperplane. [sent-235, score-0.393]
51 Definition sentences on the Web are a treasure trove of paraphrase knowledge (Section 4. [sent-238, score-0.59]
52 Our method of paraphrase acquisition from definition sentences is more accurate than wellknown competing methods (Section 4. [sent-241, score-0.885]
53 org/mo se s / 8As anonymous reviewers pointed out, they are unsupervised methods and thus unable to be adapted to definition senIby comparing definition sentence pairs with sentence pairs that are acquired from the Web using Yahoo! [sent-250, score-0.911]
54 In the latter data set, two sentences of each pair are expected to be semantically similar regardless of whether they are definition sentences. [sent-252, score-0.368]
55 competing methods In this experiment, paraphrase pairs are extracted from 100,000 definition sentence pairs that are randomly sampled from the 29,661,812 pairs. [sent-259, score-1.146]
56 First, it collects from the parallel sentences identical word pairs and their contexts (POS N-grams with indices indicating corresponding words between paired contexts) as positive examples and those of different word pairs as negative ones. [sent-262, score-0.446]
57 The most likely K positive (negative) contexts are used to extract positive (negative) paraphrases from the parallel sentences. [sent-264, score-0.522]
58 Extracted positive (negative) paraphrases and their morpho-syntactic patterns are used to collect additional positive (negative) contexts. [sent-265, score-0.389]
59 All the positive (negative) contexts are ranked, and additional paraphrases and their morpho-syntactic patterns are extracted again. [sent-266, score-0.437]
60 This iterative process finishes if no further paraphrase is extracted or the number of iterations reaches a predefined threshold T. [sent-267, score-0.527]
61 Note that paraphrases extracted by this method are not ranked. [sent-271, score-0.394]
62 If you give Moses monolingual parallel sentence pairs, it should extract a set of two phrases that are paraphrases of each other. [sent-279, score-0.526]
63 (2004) proposed a method to extract paraphrases from two manually compiled dictionaries. [sent-285, score-0.384]
64 It simply regards a difference between two definition sentences of the same word as a paraphrase candidate. [sent-286, score-0.81]
65 They assume that a paraphrase candidate tends to be a valid paraphrase if it is surrounded by infrequent strings and/or if it appears multiple times in the data. [sent-288, score-1.172]
66 The unsupervised method works in the same way as the supervised one, except that it ranks candidate phrase pairs by the sum of all 17 feature values, instead of the distance from the SVM’s hyperplane. [sent-291, score-0.425]
67 BM, SMT, Mrt, and the two versions of our method were used to extract paraphrase pairs from the same 100,000 definition sentence pairs. [sent-294, score-0.978]
68 Evaluation scheme Evaluation of each paraphrase pair (p1,p2) was based on bidirectional checking of entailment relations p1 → p2 and p2 → p1 in a way similar to the labeling →of p the training data. [sent-295, score-0.751]
69 The difference is that contexts for evaluation are two sentences that are retrieved from the Web and contain p1 and p2, instead of definition sentences from which p1 and p2 are extracted. [sent-296, score-0.435]
70 This is intended to check whether extracted paraphrases are also valid for contexts other than those from which they are extracted. [sent-297, score-0.403]
71 For the top m paraphrase pairs of each method (in the case of the BM method, randomly sampled m pairs were used, since the method does not rank paraphrase pairs), we retrieved a sentence pair (s1, s2) for each paraphrase pair (p1,p2) from the Web, such that s1 contains p1 and s2 contains p2. [sent-299, score-1.975]
72 For each method, we randomly sample n samples from all of the paraphrase pairs (p1,p2) for which both s1 and s2 are retrieved. [sent-301, score-0.639]
73 Then, from each (p1,p2) and (s1, s2), we create two paraphrase sentence pairs (s01 , s02) by exchanging p1 and p2 between s1 and s2. [sent-302, score-0.661]
74 We regard each paraphrase pair as correct if at least two annotatorsjudge that entailment relations ofboth directions hold for it. [sent-305, score-0.711]
75 You may wonder whether only one pair of sentences (s1, s2) is enough for evaluation since a correct (wrong) paraphrase pair might be judged as wrong (correct) accidentally. [sent-306, score-0.644]
76 Thus, we estimate that Sup can extract about 300,000 paraphrase pairs with a precision rate of about 94%, if we use all 29,661,812 definition sentence pairs that we acquired. [sent-315, score-1.124]
77 Furthermore, we measured precision after trivial paraphrase pairs were discarded from the evaluation samples of each method. [sent-316, score-0.772]
78 The upper half of Table 2 shows the number of extracted paraphrases with/without trivial pairs for each method. [sent-322, score-0.557]
79 As the examples indicate, many of the extracted paraphrases are not specific to definition sentences and seem very reusable. [sent-326, score-0.686]
80 However, there are few paraphrases involving metaphors or idioms in the outputs due to the nature of definition sentences. [sent-327, score-0.575]
81 11We set no threshold for candidate phrase pairs of each method, and counted all the candidate phrase pairs in Table 2. [sent-343, score-0.786]
82 (a) Definition sentence pairs with trivial paraphrases (c) Web sentence pairs with trivial paraphrases (b) Definition sentence pairs without trivial paraphrases (d) Web sentence pairs without trivial paraphrases Figure 3: Precision curves of paraphrase extraction. [sent-344, score-2.77]
83 The lower half of Table 2 shows the number of extracted paraphrases with/without trivial pairs for each method. [sent-362, score-0.557]
84 12 We think that a precision rate of at least 90% would be necessary if you apply automatically extracted paraphrases to NLP tasks with- out manual annotation. [sent-365, score-0.42]
85 Only the combination of Sup and definition sentence pairs achieved that precision. [sent-366, score-0.429]
86 Also note that, for all of the methods, the numbers of extracted paraphrases from Web sentence pairs are fewer than those from definition sentence pairs. [sent-367, score-0.846]
87 1095 5 Conclusion We proposed a method of extracting paraphrases from definition sentences on the Web. [sent-372, score-0.677]
88 Definition sentences on the Web are a treasure trove of paraphrase knowledge. [sent-375, score-0.59]
89 Our method extracts many paraphrases from the definition sentences on the Web accurately; it can extract about 300,000 paraphrases from × 6 108 Web documents with a precision rate o6f × ×ab 1o0ut 94%. [sent-377, score-1.087]
90 First, we will release extracted paraphrases from all of the 29,661,812 definition sentence pairs that we acquired, after human annotators check their validity. [sent-379, score-0.827]
91 13 Second, we plan to induce paraphrase rules from paraphrase instances. [sent-381, score-0.972]
92 Though our method can extract a variety of paraphrase instances on a large scale, their coverage might be insufficient for real NLP applications since some paraphrase phenomena are highly productive. [sent-382, score-1.035]
93 Therefore, we need paraphrase rules in addition to paraphrase instances. [sent-383, score-0.972]
94 Barzilay and McKeown (2001) induced simple POS-based paraphrase rules from paraphrase instances, which can be a good starting point. [sent-384, score-0.972]
95 Finally, as mentioned in Section 1, the work in this paper is only the beginning of our research on paraphrase extraction. [sent-385, score-0.486]
96 We are trying to extract far more paraphrases from a set of sentences fulfilling the same pragmatic function (e. [sent-386, score-0.456]
97 Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. [sent-426, score-0.547]
98 Lexical selection and paraphrase in a meaning-text generation model. [sent-447, score-0.486]
99 Automatic paraphrase acquisition based on matching of definition sentences in plural dictionaries (written in Japanese). [sent-496, score-0.853]
100 Automatic paraphrase discovery based on context and keywords between ne pairs. [sent-513, score-0.486]
wordName wordTfidf (topN-words)
[('paraphrase', 0.486), ('bone', 0.327), ('paraphrases', 0.321), ('definition', 0.254), ('osteoporosis', 0.206), ('candidate', 0.2), ('entailment', 0.148), ('morphemes', 0.138), ('pairs', 0.12), ('bones', 0.103), ('barzilay', 0.098), ('sup', 0.098), ('pronunciation', 0.096), ('bm', 0.092), ('disease', 0.092), ('morpheme', 0.087), ('mrt', 0.086), ('web', 0.08), ('murata', 0.076), ('trivial', 0.075), ('phrase', 0.073), ('sentences', 0.07), ('sampled', 0.07), ('paraphrasing', 0.066), ('mckeown', 0.065), ('smt', 0.062), ('parallel', 0.061), ('hashimoto', 0.061), ('fragments', 0.06), ('phrases', 0.058), ('precision', 0.058), ('japanese', 0.057), ('kazama', 0.057), ('szpektor', 0.056), ('quantity', 0.055), ('sentence', 0.055), ('kentaro', 0.054), ('acquired', 0.053), ('antonymous', 0.052), ('fracture', 0.052), ('child', 0.051), ('torisawa', 0.048), ('expressions', 0.047), ('pair', 0.044), ('particle', 0.044), ('shinyama', 0.044), ('ratio', 0.044), ('claim', 0.043), ('acquisition', 0.043), ('risk', 0.043), ('bidirectional', 0.043), ('androutsopoulos', 0.042), ('contexts', 0.041), ('mass', 0.041), ('extracted', 0.041), ('svm', 0.04), ('yahoo', 0.04), ('concept', 0.038), ('distributional', 0.038), ('entails', 0.037), ('decreases', 0.036), ('annotators', 0.036), ('interrogative', 0.036), ('japan', 0.035), ('akamine', 0.034), ('cartridges', 0.034), ('chikara', 0.034), ('cuisine', 0.034), ('fulfilling', 0.034), ('geffet', 0.034), ('knp', 0.034), ('kyot', 0.034), ('saeger', 0.034), ('stijn', 0.034), ('toner', 0.034), ('trove', 0.034), ('positive', 0.034), ('chris', 0.034), ('samples', 0.033), ('snippet', 0.033), ('ido', 0.033), ('claims', 0.033), ('dependency', 0.033), ('parent', 0.033), ('directions', 0.033), ('ichi', 0.032), ('kyoto', 0.032), ('method', 0.032), ('phrasal', 0.031), ('ablation', 0.031), ('extract', 0.031), ('dolan', 0.03), ('pantel', 0.03), ('checking', 0.03), ('fragile', 0.03), ('iordanskaja', 0.03), ('kauchak', 0.03), ('recipe', 0.03), ('uns', 0.03), ('yutaka', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi
Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108
2 0.49668866 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs
Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat
Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.
3 0.42666358 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
Author: Donald Metzler ; Eduard Hovy ; Chunliang Zhang
Abstract: Paraphrase generation is an important task that has received a great deal of interest recently. Proposed data-driven solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that rely on numerous language-dependent resources. Despite all of the attention, there have been very few direct empirical evaluations comparing the merits of the different approaches. This paper empirically examines the tradeoffs between simple and sophisticated paraphrase harvesting approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that very simple approaches fare surprisingly well and have a number of distinct advantages, including strong precision, good coverage, and low redundancy.
4 0.32098758 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
Author: David Chen ; William Dolan
Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
5 0.31058201 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico
Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.
6 0.13128865 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
7 0.13023113 22 acl-2011-A Probabilistic Modeling Framework for Lexical Entailment
8 0.12911786 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
9 0.12339563 144 acl-2011-Global Learning of Typed Entailment Rules
10 0.11981872 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
11 0.1193443 333 acl-2011-Web-Scale Features for Full-Scale Parsing
12 0.088869825 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
13 0.084226981 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction
14 0.083435051 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
15 0.081559598 117 acl-2011-Entity Set Expansion using Topic information
16 0.080650948 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
17 0.077691101 52 acl-2011-Automatic Labelling of Topic Models
18 0.072408929 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
19 0.068271615 11 acl-2011-A Fast and Accurate Method for Approximate String Search
20 0.068143204 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
topicId topicWeight
[(0, 0.225), (1, -0.04), (2, -0.012), (3, 0.143), (4, 0.011), (5, 0.024), (6, 0.241), (7, 0.013), (8, -0.004), (9, -0.482), (10, -0.043), (11, 0.336), (12, -0.016), (13, 0.049), (14, 0.182), (15, -0.012), (16, -0.089), (17, 0.068), (18, 0.008), (19, 0.108), (20, -0.055), (21, -0.065), (22, 0.045), (23, -0.003), (24, 0.0), (25, -0.01), (26, 0.058), (27, 0.005), (28, -0.044), (29, 0.019), (30, -0.053), (31, 0.026), (32, 0.001), (33, 0.034), (34, -0.013), (35, -0.032), (36, -0.062), (37, -0.001), (38, -0.014), (39, -0.01), (40, -0.004), (41, -0.038), (42, -0.068), (43, 0.024), (44, 0.001), (45, -0.017), (46, 0.006), (47, 0.001), (48, -0.028), (49, -0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.93589532 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi
Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108
2 0.91539663 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
Author: Donald Metzler ; Eduard Hovy ; Chunliang Zhang
Abstract: Paraphrase generation is an important task that has received a great deal of interest recently. Proposed data-driven solutions to the problem have ranged from simple approaches that make minimal use of NLP tools to more complex approaches that rely on numerous language-dependent resources. Despite all of the attention, there have been very few direct empirical evaluations comparing the merits of the different approaches. This paper empirically examines the tradeoffs between simple and sophisticated paraphrase harvesting approaches to help shed light on their strengths and weaknesses. Our evaluation reveals that very simple approaches fare surprisingly well and have a number of distinct advantages, including strong precision, good coverage, and low redundancy.
3 0.9085055 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs
Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat
Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.
4 0.73908335 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
Author: David Chen ; William Dolan
Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
5 0.72319192 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico
Abstract: This paper explores the use of bilingual parallel corpora as a source of lexical knowledge for cross-lingual textual entailment. We claim that, in spite of the inherent difficulties of the task, phrase tables extracted from parallel data allow to capture both lexical relations between single words, and contextual information useful for inference. We experiment with a phrasal matching method in order to: i) build a system portable across languages, and ii) evaluate the contribution of lexical knowledge in isolation, without interaction with other inference mechanisms. Results achieved on an English-Spanish corpus obtained from the RTE3 dataset support our claim, with an overall accuracy above average scores reported by RTE participants on monolingual data. Finally, we show that using parallel corpora to extract paraphrase tables reveals their potential also in the monolingual setting, improving the results achieved with other sources of lexical knowledge.
6 0.49688938 310 acl-2011-Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
7 0.38664511 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
8 0.37028202 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
9 0.35712174 74 acl-2011-Combining Indicators of Allophony
10 0.32268521 22 acl-2011-A Probabilistic Modeling Framework for Lexical Entailment
11 0.29771593 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
12 0.29218361 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
13 0.29200441 43 acl-2011-An Unsupervised Model for Joint Phrase Alignment and Extraction
14 0.2839658 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
15 0.27479222 333 acl-2011-Web-Scale Features for Full-Scale Parsing
16 0.2724497 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment
17 0.26531535 144 acl-2011-Global Learning of Typed Entailment Rules
18 0.25652939 297 acl-2011-That's What She Said: Double Entendre Identification
19 0.25533521 115 acl-2011-Engkoo: Mining the Web for Language Learning
20 0.24979275 321 acl-2011-Unsupervised Discovery of Rhyme Schemes
topicId topicWeight
[(5, 0.028), (17, 0.065), (26, 0.027), (37, 0.082), (39, 0.045), (41, 0.042), (53, 0.322), (55, 0.041), (59, 0.049), (72, 0.031), (73, 0.013), (91, 0.026), (96, 0.127), (97, 0.011), (98, 0.025)]
simIndex simValue paperId paperTitle
1 0.79529923 159 acl-2011-Identifying Noun Product Features that Imply Opinions
Author: Lei Zhang ; Bing Liu
Abstract: Identifying domain-dependent opinion words is a key problem in opinion mining and has been studied by several researchers. However, existing work has been focused on adjectives and to some extent verbs. Limited work has been done on nouns and noun phrases. In our work, we used the feature-based opinion mining model, and we found that in some domains nouns and noun phrases that indicate product features may also imply opinions. In many such cases, these nouns are not subjective but objective. Their involved sentences are also objective sentences and imply positive or negative opinions. Identifying such nouns and noun phrases and their polarities is very challenging but critical for effective opinion mining in these domains. To the best of our knowledge, this problem has not been studied in the literature. This paper proposes a method to deal with the problem. Experimental results based on real-life datasets show promising results. 1
same-paper 2 0.77758551 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
Author: Chikara Hashimoto ; Kentaro Torisawa ; Stijn De Saeger ; Jun'ichi Kazama ; Sadao Kurohashi
Abstract: ¶ kuro@i . We propose an automatic method of extracting paraphrases from definition sentences, which are also automatically acquired from the Web. We observe that a huge number of concepts are defined in Web documents, and that the sentences that define the same concept tend to convey mostly the same information using different expressions and thus contain many paraphrases. We show that a large number of paraphrases can be automatically extracted with high precision by regarding the sentences that define the same concept as parallel corpora. Experimental results indicated that with our method it was possible to extract about 300,000 paraphrases from 6 Web docu3m0e0n,t0s0 w0i ptha a precision oramte 6 6o ×f a 1b0out 94%. 108
3 0.73829687 323 acl-2011-Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections
Author: Dipanjan Das ; Slav Petrov
Abstract: We describe a novel approach for inducing unsupervised part-of-speech taggers for languages that have no labeled training data, but have translated text in a resource-rich language. Our method does not assume any knowledge about the target language (in particular no tagging dictionary is assumed), making it applicable to a wide array of resource-poor languages. We use graph-based label propagation for cross-lingual knowledge transfer and use the projected labels as features in an unsupervised model (BergKirkpatrick et al., 2010). Across eight European languages, our approach results in an average absolute improvement of 10.4% over a state-of-the-art baseline, and 16.7% over vanilla hidden Markov models induced with the Expectation Maximization algorithm.
4 0.714993 66 acl-2011-Chinese sentence segmentation as comma classification
Author: Nianwen Xue ; Yaqin Yang
Abstract: We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detecting commas that signal sentence boundaries.
5 0.65770996 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
Author: Qin Gao ; Stephan Vogel
Abstract: We present an approach of expanding parallel corpora for machine translation. By utilizing Semantic role labeling (SRL) on one side of the language pair, we extract SRL substitution rules from existing parallel corpus. The rules are then used for generating new sentence pairs. An SVM classifier is built to filter the generated sentence pairs. The filtered corpus is used for training phrase-based translation models, which can be used directly in translation tasks or combined with baseline models. Experimental results on ChineseEnglish machine translation tasks show an average improvement of 0.45 BLEU and 1.22 TER points across 5 different NIST test sets.
6 0.63354707 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs
7 0.57906079 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
8 0.57733786 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques
9 0.56968582 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
10 0.56681156 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models
11 0.55183744 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
12 0.54913443 136 acl-2011-Finding Deceptive Opinion Spam by Any Stretch of the Imagination
13 0.54076028 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates
14 0.5399816 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews
15 0.53445834 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment
16 0.52564132 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
17 0.51979679 162 acl-2011-Identifying the Semantic Orientation of Foreign Words
18 0.51846743 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
19 0.51424384 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems
20 0.51071566 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora