acl acl2013 acl2013-12 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims
Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. [sent-8, score-1.915]
2 Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0. [sent-9, score-1.031]
3 09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0. [sent-11, score-0.392]
4 This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. [sent-13, score-1.264]
5 We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. [sent-14, score-1.336]
6 We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures. [sent-15, score-1.272]
7 1 Introduction Despite the well-established technical distinction between semantic similarity and relatedness (Agirre et al. [sent-16, score-0.839]
8 , 2009; Budanitsky and Hirst, 2006; Resnik, 1995), comparison to established similarity norms from psychology remains part of the standard evaluative procedure for assessing computational measures of semantic relatedness. [sent-17, score-0.976]
9 Because similarity is only one particular type of relatedness, comparison to similarity norms fails to give a complete view of a relatedness measure’s efficacy. [sent-18, score-1.423]
10 In keeping with Budanitsky and Hirst’s (2006) observation that “comparison with human judgments is the ideal way to evaluate a measure of similarity or relatedness,” we have undertaken the creation of a new set of relatedness norms. [sent-19, score-0.861]
11 2 Background The similarity norms of Rubenstein and Goodenough (1965; henceforth R&G;) and Miller and Charles (1991 ; henceforth M&C;) have seen ubiquitous use in evaluation of computational measures of semantic similarity and relatedness. [sent-20, score-1.045]
12 R&G; established their similarity norms by presenting subjects with 65 slips of paper, each of which contained a pair of nouns. [sent-21, score-0.977]
13 Subjects were directed to read through all 65 noun pairs, then sort the pairs “according to amount of ‘similarity of meaning. [sent-22, score-0.139]
14 ’” Subjects then assigned similarity scores to each pair on a scale of 0. [sent-23, score-0.25]
15 M&C; repeated R&G;’s study using a subset of 30 of the original word pairs, and their resulting similarity norms correlated to the R&G; norms at r = 0. [sent-27, score-1.124]
16 Resnik’s (1995) subsequent replication of M&C;’s study similarly yielded a correlation of r = 0. [sent-29, score-0.249]
17 The M&C; pairs were also included in a similarity study by Finkelstein et al. [sent-31, score-0.357]
18 , 2002) has recently emerged as a potential surrogate dataset for evaluating relatedness measures. [sent-36, score-0.596]
19 Several studies have reported correlation to WordSim353 norms as part of their evaluation procedures, with some studies explicitly referring to it as a collection of human-assigned relatedness scores (Gabrilovich and Markovitch, 2007; Hughes and Ramage, 2007; Milne and Witten, 2008). [sent-37, score-1.12]
20 ’s subjects give us pause to reconsider WordSim353’s classification as a set of relatedness norms. [sent-41, score-0.762]
21 Jarmasz and Szpakowicz (2003) have raised further methodological concerns about the construction of WordSim353, including: (a) similarity was rated on a scale of 0. [sent-43, score-0.251]
22 0 used by R&G; and M&C;, and (b) the inclusion of proper nouns introduced an element of cultural bias into the dataset (e. [sent-47, score-0.069]
23 Cognizant of the problematic conflation of similarity and relatedness in WordSim353, Agirre et al. [sent-50, score-0.85]
24 (2009) partitioned the data into two sets: one containing noun pairs exhibiting similarity, and one containing pairs of related but dissimilar nouns. [sent-51, score-0.297]
25 However, pairs in the latter set were not assessed for scoring distribution validity to ensure that strongly related word pairs were not penalized by human subjects for being dissimilar. [sent-52, score-0.426]
26 1 3 Methodology In our experiments, we elicited human ratings of semantic relatedness for 122 noun pairs. [sent-53, score-0.844]
27 In doing so, we followed the methodology of Rubenstein and Goodenough (1965) as closely as possible: participants were instructed to read through a set of noun pairs, sort them by how strongly related they were, and then assign each pair a relatedness score on a scale of 0. [sent-54, score-0.875]
28 First, instead of asking participants to judge “amount of ‘similarity of meaning,’” we asked them to judge “how closely related in meaning” each pair of nouns was. [sent-58, score-0.271]
29 Second, we used a Web interface to collect data in our study; instead of reordering a deck of cards, participants were presented with a grid of cards that they were able 1Perhaps not surprisingly, the highest scores in WordSim353 (all ratings from 9. [sent-59, score-0.274]
30 1 Experimental Conditions Each participant in our study was randomly assigned to one of four conditions. [sent-65, score-0.056]
31 Of those pairs, 10 were randomly selected from from WordNet++ (Ponzetto and Navigli, 2010) and 10 from SGN (Szumlanski and Gomez, 2010)—two semantic networks that categorically indicate strong relatedness between WordNet noun senses. [sent-67, score-0.724]
32 10 additional pairs were generated by randomly pairing words from a list of all nouns occurring in Wikipedia. [sent-68, score-0.154]
33 The nouns in the pairs we used from each of these three sources were matched for frequency of occurrence in Wikipedia. [sent-69, score-0.154]
34 We manually selected two additional pairs that appeared across all four conditions: leaves–rake and lion–cage. [sent-70, score-0.085]
35 These control pairs were included to ensure that each condition contained examples of strong semantic relatedness, and potentially to help identify and eliminate data from participants who assigned random relatedness scores. [sent-71, score-0.997]
36 Within each condition, the 32 word pairs were presented to all subjects in the same random order. [sent-72, score-0.282]
37 Across conditions, the two control pairs were always presented in the same positions in the word pair grid. [sent-73, score-0.119]
38 Each word pair was subjected to additional scrutiny before being included in our dataset. [sent-74, score-0.034]
39 The latter were eliminated to prevent subjects from latching onto superficial lexical commonalities as indicators of strong semantic relatedness without reflecting upon meaning. [sent-78, score-0.97]
40 2 Participants Participants in our study were recruited from introductory undergraduate courses in psychology and computer science at the University of Central Florida. [sent-80, score-0.167]
41 Of these, we identified 19 as outliers, and their data were excluded from our norms to prevent interference from individuals who appeared to be assigning random scores to noun pairs. [sent-86, score-0.521]
42 We considered an outlier to be any individual whose numeric ratings fell outside two standard deviations from the means for more than 10% of the word pairs they evaluated (i. [sent-87, score-0.374]
43 , at least four word pairs, since each condition contained 32 word pairs). [sent-89, score-0.089]
44 For outlier detection, means and standard de- viations were computed using leave-one-out sampling. [sent-90, score-0.144]
45 That is, data from individual J were not incorporated into means or standard deviations when considering whether to eliminate J as an outlier. [sent-91, score-0.148]
46 3 Of the 73 participants remaining after outlier elimination, there was a near-even split between males (37) and females (35), with one individual declining to provide any demographic data. [sent-92, score-0.285]
47 Most students were freshmen (49), followed in frequency by sophomores (16), seniors (4), and juniors (3). [sent-96, score-0.073]
48 Participants earned an average score of 42% on a standardized test of advanced vocabulary (σ = 16%, N = 72) (Test I V-4 from Ekstrom et al. [sent-97, score-0.03]
49 – 4 Results Each word pair in Rel-122 was evaluated by at least 20 human subjects. [sent-99, score-0.034]
50 After outlier removal (described above), each word pair retained evaluations from 14 to 22 individuals. [sent-100, score-0.131]
51 4 An excerpt of the Rel-122 norms is shown in Table 1. [sent-102, score-0.464]
52 We note that the highest rated pairs in our dataset are not strictly similar entities; exactly half ofthe 10 most strongly related nouns in Table 1are dissimilar (e. [sent-103, score-0.374]
53 Judgments from individual subjects in our study exhibited high average correlation to the elicited relatedness means (r = 0. [sent-106, score-1.125]
54 Resnik (1995), in his replication of the 3We used this sampling method to prevent extreme outliers from masking their own aberration during outlier detection, which is potentially problematic when dealing with small populations. [sent-109, score-0.304]
55 Without leave-one-out-sampling, we would have identified fewer outliers (14 instead of 19), but the resulting means would still have correlated strongly to our final relatedness norms (r = 0. [sent-110, score-1.157]
56 M&C; study, reported average individual correlation of r = 0. [sent-117, score-0.166]
57 07, N = 10) to similarity means elicited from a population of 10 graduate students and postdoctoral researchers. [sent-119, score-0.403]
58 Presumably Resnik’s subjects had advanced knowledge of what constitutes semantic similarity, as he estab- lished r = 0. [sent-120, score-0.314]
59 90 as an upper bound for expected human correlation on that task. [sent-121, score-0.129]
60 The fact that average human correlation in our study is weaker than in previous studies suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity, and that a reasonable computational measure ofrelatedness might only approach a correlation of r = 0. [sent-122, score-1.275]
61 In Table 2, we present the performance of a variety of relatedness and similarity measures on our new set of relatedness means. [sent-124, score-1.417]
62 5 Coefficients of correlation are given for Pearson’s product-moment correlation (r), as well as Spearman’s rank correlation (ρ). [sent-125, score-0.387]
63 For comparison, we include results for the correlation of these measures to the M&C; and R&G; similarity means. [sent-126, score-0.416]
64 The generally weak performance of the WordNet-based measures on this task is not surprising, given WordNet’s strong disposition toward codifying semantic similarity, which makes it an impoverished resource for discovering general semantic relatedness. [sent-127, score-0.261]
65 892 Rel-122 M&C; R&G; (*) are considered relatedness measures. [sent-131, score-0.565]
66 All measures are WordNet-based, except for the scoring metric of Szumlanski and Gomez (2010), which is based on lexical co-occurrence frequency in Wikipedia. [sent-132, score-0.071]
67 have been hampered by their reliance upon Word- Net. [sent-133, score-0.065]
68 The disparity between their performance on Rel-122 and the M&C; and R&G; norms suggests the shortcomings of using similarity norms for evaluating measures of relatedness. [sent-134, score-1.202]
69 5 (Re-)Evaluating Similarity Norms After establishing our relatedness norms, we created two additional experimental conditions in which subjects evaluated the relatedness of noun pairs from the M&C; study. [sent-135, score-1.516]
70 Each condition again had 32 noun pairs: 15 from M&C; and 17 from Rel-122. [sent-136, score-0.108]
71 Pairs from M&C; and Rel-122 were uniformly distributed between these two new conditions based on matched normative similarity or relatedness scores from their respective datasets. [sent-137, score-0.831]
72 Results from this second phase of our study are shown in Table 3. [sent-138, score-0.056]
73 The correlation of our relatedness means on this set to the similarity means of M&C; was strong (r = 0. [sent-139, score-1.051]
74 91), but not as strong as in replications of the study that asked subjects to evaluate similarity (e. [sent-140, score-0.543]
75 That the synonymous M&C; pairs garner high relatedness ratings in our study is not surprising; strong similarity is, after all, one type of strong relatedness. [sent-146, score-1.12]
76 The more interesting result from 893 our study, shown in Table 3, is that relatedness norms for pairs that are related but dissimilar (e. [sent-147, score-1.149]
77 , journey–car and forest–graveyard) deviate significantly from established similarity norms. [sent-149, score-0.285]
78 This indicates that asking subjects to evaluate “similarity” instead of “relatedness” can significantly impact the norms established in such studies. [sent-150, score-0.736]
79 6 Conclusions We have established a new set of relatedness norms, Rel-122, that is offered as a supplementary evaluative standard for assessing semantic relatedness measures. [sent-151, score-1.332]
80 We have also demonstrated the shortcomings of using similarity norms to evaluate such measures. [sent-152, score-0.674]
81 Namely, since similarity is only one type of relatedness, comparison to similarity norms fails to provide a complete view of a measure’s ability to capture more general types of relatedness. [sent-153, score-0.858]
82 This is particularly problematic when evaluating WordNet-based measures, which naturally excel at capturing similarity, given the nature of the WordNet ontology. [sent-154, score-0.1]
83 Furthermore, we have found that asking judges to evaluate “relatedness” of terms, rather than “similarity,” has a substantive impact on resulting norms, particularly with respect to the M&C; similarity dataset. [sent-155, score-0.287]
84 Correlation of individual judges’ ratings to resulting means was also significantly lower on average in our study than in previous studies that focused on similarity (e. [sent-156, score-0.429]
85 These results suggest that human perceptions of relatedness are less strictly constrained than perceptions of similarity and validate the need for new relatedness norms to supplement existing gold standard similarity norms in the evaluation of relatedness measures. [sent-159, score-3.348]
86 A study on similarity and relatedness using distributional and WordNet-based approaches. [sent-162, score-0.837]
87 Extended gloss overlaps as a measure of semantic relatedness. [sent-166, score-0.085]
88 Semantic similarity based on corpus statistics and lexical taxonomy. [sent-205, score-0.216]
89 Combining local context and WordNet similarity for word sense identification. [sent-209, score-0.216]
90 An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. [sent-225, score-0.65]
91 Using WordNet-based context vectors to estimate the semantic relatedness of concepts. [sent-229, score-0.623]
92 Using information content to evaluate semantic similarity in a taxonomy. [sent-241, score-0.274]
wordName wordTfidf (topN-words)
[('relatedness', 0.565), ('norms', 0.426), ('similarity', 0.216), ('subjects', 0.197), ('perceptions', 0.141), ('correlation', 0.129), ('participants', 0.124), ('szumlanski', 0.122), ('finkelstein', 0.112), ('outlier', 0.097), ('elicited', 0.094), ('rubenstein', 0.094), ('resnik', 0.091), ('pairs', 0.085), ('gomez', 0.081), ('dissimilar', 0.073), ('ratings', 0.073), ('wordnet', 0.072), ('measures', 0.071), ('established', 0.069), ('nouns', 0.069), ('hirst', 0.066), ('replication', 0.064), ('agirre', 0.063), ('pedersen', 0.063), ('cousin', 0.061), ('ekstrom', 0.061), ('jarmasz', 0.061), ('ucf', 0.061), ('psychology', 0.061), ('eecs', 0.06), ('florida', 0.06), ('gabrilovich', 0.06), ('outliers', 0.06), ('strongly', 0.059), ('semantic', 0.058), ('budanitsky', 0.058), ('study', 0.056), ('noun', 0.054), ('condition', 0.054), ('judgments', 0.053), ('strictly', 0.053), ('conditions', 0.05), ('patwardhan', 0.05), ('cards', 0.05), ('courses', 0.05), ('valerie', 0.05), ('goodenough', 0.047), ('hughes', 0.047), ('strong', 0.047), ('means', 0.047), ('students', 0.046), ('evaluative', 0.045), ('milne', 0.045), ('sean', 0.045), ('asking', 0.044), ('evgeniy', 0.043), ('problematic', 0.042), ('ted', 0.042), ('prevent', 0.041), ('central', 0.04), ('instructed', 0.039), ('excerpt', 0.038), ('reliance', 0.038), ('individual', 0.037), ('contained', 0.035), ('deviations', 0.035), ('rated', 0.035), ('eliminated', 0.035), ('ponzetto', 0.035), ('pair', 0.034), ('constrained', 0.034), ('siddharth', 0.033), ('shortcomings', 0.032), ('banerjee', 0.032), ('graeme', 0.031), ('synonymous', 0.031), ('evaluating', 0.031), ('advanced', 0.03), ('assessing', 0.03), ('correlates', 0.029), ('henceforth', 0.029), ('eliminate', 0.029), ('constitutes', 0.029), ('ijcai', 0.029), ('judges', 0.027), ('measure', 0.027), ('commonalities', 0.027), ('conflation', 0.027), ('critique', 0.027), ('deck', 0.027), ('declining', 0.027), ('excel', 0.027), ('freshmen', 0.027), ('gadi', 0.027), ('hampered', 0.027), ('harman', 0.027), ('impoverished', 0.027), ('journey', 0.027), ('replications', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims
Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.
2 0.2216841 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
Author: Mohammad Taher Pilehvar ; David Jurgens ; Roberto Navigli
Abstract: Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: seman- tic textual similarity, word similarity, and word sense coarsening.
3 0.15550081 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire
Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.
4 0.13977794 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity
Author: Daniel Bar ; Torsten Zesch ; Iryna Gurevych
Abstract: We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.
5 0.12093594 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit
Author: Vasile Rus ; Mihai Lintean ; Rajendra Banjade ; Nobal Niraula ; Dan Stefanescu
Abstract: We present in this paper SEMILAR, the SEMantic simILARity toolkit. SEMILAR implements a number of algorithms for assessing the semantic similarity between two texts. It is available as a Java library and as a Java standalone ap-plication offering GUI-based access to the implemented semantic similarity methods. Furthermore, it offers facilities for manual se-mantic similarity annotation by experts through its component SEMILAT (a SEMantic simILarity Annotation Tool). 1
6 0.11877895 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models
7 0.10880362 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models
8 0.10670208 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments
9 0.10504702 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
10 0.099153794 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context
11 0.098674968 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
12 0.098487191 249 acl-2013-Models of Semantic Representation with Visual Attributes
13 0.082741246 242 acl-2013-Mining Equivalent Relations from Linked Data
14 0.07554242 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners
15 0.070474394 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates
16 0.069629207 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
17 0.068770058 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics
18 0.068320222 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD
19 0.065096021 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics
20 0.06416212 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules
topicId topicWeight
[(0, 0.146), (1, 0.068), (2, 0.049), (3, -0.162), (4, -0.046), (5, -0.131), (6, -0.1), (7, 0.042), (8, 0.011), (9, -0.004), (10, -0.051), (11, 0.072), (12, 0.002), (13, -0.048), (14, 0.098), (15, 0.067), (16, 0.016), (17, 0.014), (18, -0.039), (19, -0.016), (20, 0.034), (21, 0.008), (22, 0.025), (23, -0.052), (24, -0.084), (25, 0.115), (26, 0.052), (27, 0.053), (28, -0.053), (29, -0.1), (30, -0.033), (31, -0.054), (32, 0.059), (33, -0.056), (34, 0.023), (35, 0.011), (36, 0.039), (37, 0.066), (38, -0.054), (39, 0.016), (40, 0.04), (41, 0.022), (42, -0.076), (43, -0.012), (44, -0.031), (45, -0.081), (46, -0.086), (47, -0.106), (48, -0.037), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.96912527 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims
Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.
2 0.83476353 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity
Author: Daniel Bar ; Torsten Zesch ; Iryna Gurevych
Abstract: We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.
3 0.8271758 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments
Author: Tony Veale ; Guofu Li
Abstract: Just as observing is more than just seeing, comparing is far more than mere matching. It takes understanding, and even inventiveness, to discern a useful basis for judging two ideas as similar in a particular context, especially when our perspective is shaped by an act of linguistic creativity such as metaphor, simile or analogy. Structured resources such as WordNet offer a convenient hierarchical means for converging on a common ground for comparison, but offer little support for the divergent thinking that is needed to creatively view one concept as another. We describe such a means here, by showing how the web can be used to harvest many divergent views for many familiar ideas. These lateral views complement the vertical views of WordNet, and support a system for idea exploration called Thesaurus Rex. We show also how Thesaurus Rex supports a novel, generative similarity measure for WordNet. 1 Seeing is Believing (and Creating) Similarity is a cognitive phenomenon that is both complex and subjective, yet for practical reasons it is often modeled as if it were simple and objective. This makes sense for the many situations where we want to align our similarity judgments with those of others, and thus focus on the same conventional properties that others are also likely to focus upon. This reliance on the consensus viewpoint explains why WordNet (Fellbaum, 1998) has proven so useful as a basis for computational measures of lexico-semantic similarity Guofu Li School of Computer Science and Informatics, University College Dublin, Belfield, Dublin D2, Ireland. l .guo fu . l gmai l i @ .com (e.g. see Pederson et al. 2004, Budanitsky & Hirst, 2006; Seco et al. 2006). These measures reduce the similarity of two lexical concepts to a single number, by viewing similarity as an objective estimate of the overlap in their salient qualities. This convenient perspective is poorly suited to creative or insightful comparisons, but it is sufficient for the many mundane comparisons we often perform in daily life, such as when we organize books or look for items in a supermarket. So if we do not know in which aisle to locate a given item (such as oatmeal), we may tacitly know how to locate a similar product (such as cornflakes) and orient ourselves accordingly. Yet there are occasions when the recognition of similarities spurs the creation of similarities, when the act of comparison spurs us to invent new ways of looking at an idea. By placing pop tarts in the breakfast aisle, food manufacturers encourage us to view them as a breakfast food that is not dissimilar to oatmeal or cornflakes. When ex-PM Tony Blair published his memoirs, a mischievous activist encouraged others to move his book from Biography to Fiction in bookshops, in the hope that buyers would see it in a new light. Whenever we use a novel metaphor to convey a non-obvious viewpoint on a topic, such as “cigarettes are time bombs”, the comparison may spur us to insight, to see aspects of the topic that make it more similar to the vehicle (see Ortony, 1979; Veale & Hao, 2007). In formal terms, assume agent A has an insight about concept X, and uses the metaphor X is a Y to also provoke this insight in agent B. To arrive at this insight for itself, B must intuit what X and Y have in common. But this commonality is surely more than a standard categorization of X, or else it would not count as an insight about X. To understand the metaphor, B must place X 660 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 6 0–670, in a new category, so that X can be seen as more similar to Y. Metaphors shape the way we per- ceive the world by re-shaping the way we make similarity judgments. So if we want to imbue computers with the ability to make and to understand creative metaphors, we must first give them the ability to look beyond the narrow viewpoints of conventional resources. Any measure that models similarity as an objective function of a conventional worldview employs a convergent thought process. Using WordNet, for instance, a similarity measure can vertically converge on a common superordinate category of both inputs, and generate a single numeric result based on their distance to, and the information content of, this common generalization. So to find the most conventional ways of seeing a lexical concept, one simply ascends a narrowing concept hierarchy, using a process de Bono (1970) calls vertical thinking. To find novel, non-obvious and useful ways of looking at a lexical concept, one must use what Guilford (1967) calls divergent thinking and what de Bono calls lateral thinking. These processes cut across familiar category boundaries, to simultaneously place a concept in many different categories so that we can see it in many different ways. de Bono argues that vertical thinking is selective while lateral thinking is generative. Whereas vertical thinking concerns itself with the “right” way or a single “best” way of looking at things, lateral thinking focuses on producing alternatives to the status quo. To be as useful for creative tasks as they are for conventional tasks, we need to re-imagine our computational similarity measures as generative rather than selective, expansive rather than reductive, divergent as well as convergent and lateral as well as vertical. Though WordNet is ideally structured to support vertical, convergent reasoning, its comprehensive nature means it can also be used as a solid foundation for building a more lateral and divergent model of similarity. Here we will use the web as a source of diverse perspectives on familiar ideas, to complement the conventional and often narrow views codified by WordNet. Section 2 provides a brief overview of past work in the area of similarity measurement, before section 3 describes a simple bootstrapping loop for acquiring richly diverse perspectives from the web for a wide variety of familiar ideas. These perspectives are used to enhance a Word- Net-based measure of lexico-semantic similarity in section 4, by broadening the range of informative viewpoints the measure can select from. Similarity is thus modeled as a process that is both generative and selective. This lateral-andvertical approach is evaluated in section 5, on the Miller & Charles (1991) data-set. A web app for the lateral exploration of diverse viewpoints, named Thesaurus Rex, is also presented, before closing remarks are offered in section 6. 2 Related Work and Ideas WordNet’s taxonomic organization of nounsenses and verb-senses – in which very general categories are successively divided into increasingly informative sub-categories or instancelevel ideas – allows us to gauge the overlap in information content, and thus of meaning, of two lexical concepts. We need only identify the deepest point in the taxonomy at which this content starts to diverge. This point of divergence is often called the LCS, or least common subsumer, of two concepts (Pederson et al., 2004). Since sub-categories add new properties to those they inherit from their parents – Aristotle called these properties the differentia that stop a category system from trivially collapsing into itself – the depth of a lexical concept in a taxonomy is an intuitive proxy for its information content. Wu & Palmer (1994) use the depth of a lexical concept in the WordNet hierarchy as such a proxy, and thereby estimate the similarity of two lexical concepts as twice the depth of their LCS divided by the sum of their individual depths. Leacock and Chodorow (1998) instead use the length of the shortest path between two concepts as a proxy for the conceptual distance between them. To connect any two ideas in a hierarchical system, one must vertically ascend the hierarchy from one concept, change direction at a potential LCS, and then descend the hierarchy to reach the second concept. (Aristotle was also first to suggest this approach in his Poetics). Leacock and Chodorow normalize the length of this path by dividing its size (in nodes) by twice the depth of the deepest concept in the hierarchy; the latter is an upper bound on the distance between any two concepts in the hierarchy. Negating the log of this normalized length yields a corresponding similarity score. While the role of an LCS is merely implied in Leacock and Chodorow’s use of a shortest path, the LCS is pivotal nonetheless, and like that of Wu & Palmer, the approach uses an essentially vertical reasoning process to identify a single “best” generalization. Depth is a convenient proxy for information content, but more nuanced proxies can yield 661 more rounded similarity measures. Resnick (1995) draws on information theory to define the information content of a lexical concept as the negative log likelihood of its occurrence in a corpus, either explicitly (via a direct mention) or by presupposition (via a mention of any of its sub-categories or instances). Since the likelihood of a general category occurring in a corpus is higher than that of any of its sub-categories or instances, such categories are more predictable, and less informative, than rarer categories whose occurrences are less predictable and thus more informative. The negative log likelihood of the most informative LCS of two lexical concepts offers a reliable estimate of the amount of infor- mation shared by those concepts, and thus a good estimate of their similarity. Lin (1998) combines the intuitions behind Resnick’s metric and that of Wu and Palmer to estimate the similarity of two lexical concepts as an information ratio: twice the information content of their LCS divided by the sum of their individual information contents. Jiang and Conrath (1997) consider the converse notion of dissimilarity, noting that two lexical concepts are dissimilar to the extent that each contains information that is not shared by the other. So if the information content of their most informative LCS is a good measure of what they do share, then the sum of their individual information contents, minus twice the content of their most informative LCS, is a reliable estimate of their dissimilarity. Seco et al. (2006) presents a minor innovation, showing how Resnick’s notion of information content can be calculated without the use of an external corpus. Rather, when using Resnick’s metric (or that of Lin, or Jiang and Conrath) for measuring the similarity of lexical concepts in WordNet, one can use the category structure of WordNet itself to estimate infor- mation content. Typically, the more general a concept, the more descendants it will possess. Seco et al. thus estimate the information content of a lexical concept as the log of the sum of all its unique descendants (both direct and indirect), divided by the log of the total number of concepts in the entire hierarchy. Not only is this intrinsic view of information content convenient to use, without recourse to an external corpus, Seco et al. show that it offers a better estimate of information content than its extrinsic, corpus-based alternatives, as measured relative to average human similarity ratings for the 30 word-pairs in the Miller & Charles (1991) test set. A similarity measure can draw on other sources of information besides WordNet’s category structures. One might eke out additional information from WordNet’s textual glosses, as in Lesk (1986), or use category structures other than those offered by WordNet. Looking beyond WordNet, entries in the online encyclopedia Wikipedia are not only connected by a dense topology of lateral links, they are also organized by a rich hierarchy of overlapping categories. Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. Nonetheless, WordNet can be a valuable component of a hybrid measure, and Agirre et al. (2009) use an SVM (support vector machine) to combine information from WordNet with information harvested from the web. Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. Similarity is not always applied to pairs of concepts; it is sometimes analogically applied to pairs of pairs of concepts, as in proportional analogies of the form A is to B as C is to D (e.g., hacks are to writers as mercenaries are to soldiers, or chisels are to sculptors as scalpels are to surgeons). In such analogies, one is really assessing the similarity of the unstated relationship between each pair of concepts: thus, mercenaries are soldiers whose allegiance is paid for, much as hacks are writers with income-driven loyalties; sculptors use chisels to carve stone, while surgeons use scalpels to cut or carve flesh. Veale (2004) used WordNet to assess the similarity of A:B to C:D as a function of the combined similarity of A to C and of B to D. In contrast, Turney (2005) used the web to pursue a more divergent course, to represent the tacit relationships of A to B and of C to D as points in a highdimensional space. The dimensions of this space initially correspond to linking phrases on the web, before these dimensions are significantly reduced using singular value decomposition. In the infamous SAT test, an analogy A:B::C:D has four other pairs of concepts that serve as likely distractors (e.g. singer:songwriter for hack:writer) and the goal is to choose the most appropriate C:D pair for a given A:B pairing. Using variants of Wu and Palmer (1994) on the 374 SAT analogies of Turney (2005), Veale (2004) reports a success rate of 38–44% using only WordNet-based similarity. In contrast, Turney (2005) reports up to 55% success on the same analogies, partly because his approach aims 662 to match implicit relations rather than explicit concepts, and in part because it uses a divergent process to gather from the web as rich a perspec- tive as it can on these latent relationships. 2.1 Clever Comparisons Create Similarity Each of these approaches to similarity is a user of information, rather than a creator, and each fails to capture how a creative comparison (such as a metaphor) can spur a listener to view a topic from an atypical perspective. Camac & Glucksberg (1984) provide experimental evidence for the claim that “metaphors do not use preexisting associations to achieve their effects [… ] people use metaphors to create new relations between concepts.” They also offer a salutary reminder of an often overlooked fact: every comparison exploits information, but each is also a source of new information in its own right. Thus, “this cola is acid” reveals a different perspective on cola (e.g. as a corrosive substance or an irritating food) than “this acid is cola” highlights for acid (such as e.g., a familiar substance) Veale & Keane (1994) model the role of similarity in realizing the long-term perlocutionary effect of an informative comparison. For example, to compare surgeons to butchers is to encourage one to see all surgeons as more bloody, … crude or careless. The reverse comparison, of butchers to surgeons, encourages one to see butchers as more skilled and precise. Veale & Keane present a network model of memory, called Sapper, in which activation can spread between related concepts, thus allowing one concept to prime the properties of a neighbor. To interpret an analogy, Sapper lays down new activation-carrying bridges in memory between analogical counterparts, such as between surgeon & butcher, flesh & meat, and scalpel & cleaver. Comparisons can thus have lasting effects on how Sapper sees the world, changing the pattern of activation that arises when it primes a concept. Veale (2003) adopts a similarly dynamic view of similarity in WordNet, showing how an analogical comparison can result in the automatic addition of new categories and relations to WordNet itself. Veale considers the problem of finding an analogical mapping between different parts of WordNet’s noun-sense hierarchy, such as between instances of Greek god and Norse god, or between the letters of different alphabets, such as of Greek and Hebrew. But no structural similarity measure for WordNet exhibits enough discernment to e.g. assign a higher similarity to Zeus & Odin (each is the supreme deity of its pantheon) than to a pairing of Zeus and any other Norse god, just as no structural measure will assign a higher similarity to Alpha & Aleph or to Beta & Beth than to any random letter pairing. A fine-grained category hierarchy permits fine-grained similarity judgments, and though WordNet is useful, its sense hierarchies are not especially fine-grained. However, we can automatically make WordNet subtler and more discerning, by adding new fine-grained categories to unite lexical concepts whose similarity is not reflected by any existing categories. Veale (2003) shows how a property that is found in the glosses of two lexical concepts, of the same depth, can be combined with their LCS to yield a new fine-grained parent category, so e.g. “supreme” + deity = Supreme-deity (for Odin, Zeus, Jupiter, etc.) and “1 st” + letter = 1st-letter (for Alpha, Aleph, etc.) Selected aspects of the textual similarity of two WordNet glosses – the key to similarity in Lesk (1986) – can thus be reified into an explicitly categorical WordNet form. 3 Divergent (Re)Categorization To tap into a richer source of concept properties than WordNet’s glosses, we can use web ngrams. Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). The numbers to the right are Google frequency counts. a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 To find the stable properties that can underpin a meaningful fine-grained category for cowboy, we must seek out the properties that are so often presupposed to be salient of all cowboys that one can use them to anchor a simile, such as
4 0.79447365 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit
Author: Vasile Rus ; Mihai Lintean ; Rajendra Banjade ; Nobal Niraula ; Dan Stefanescu
Abstract: We present in this paper SEMILAR, the SEMantic simILARity toolkit. SEMILAR implements a number of algorithms for assessing the semantic similarity between two texts. It is available as a Java library and as a Java standalone ap-plication offering GUI-based access to the implemented semantic similarity methods. Furthermore, it offers facilities for manual se-mantic similarity annotation by experts through its component SEMILAT (a SEMantic simILarity Annotation Tool). 1
5 0.77242023 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire
Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.
6 0.7613191 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
7 0.7462759 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
8 0.71368533 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models
9 0.61154592 238 acl-2013-Measuring semantic content in distributional vectors
10 0.55267787 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
11 0.53866661 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian
12 0.52898782 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
13 0.52828312 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD
14 0.52732718 242 acl-2013-Mining Equivalent Relations from Linked Data
15 0.5233385 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
16 0.51706415 222 acl-2013-Learning Semantic Textual Similarity with Structural Representations
17 0.51049876 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages
18 0.50335652 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners
19 0.49127495 89 acl-2013-Computerized Analysis of a Verbal Fluency Test
20 0.4817085 116 acl-2013-Detecting Metaphor by Contextual Analogy
topicId topicWeight
[(0, 0.552), (6, 0.017), (11, 0.056), (15, 0.03), (24, 0.031), (26, 0.017), (35, 0.06), (42, 0.023), (48, 0.033), (70, 0.014), (88, 0.021), (90, 0.034), (95, 0.035)]
simIndex simValue paperId paperTitle
1 0.97212684 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity
Author: Daniel Bar ; Torsten Zesch ; Iryna Gurevych
Abstract: We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.
same-paper 2 0.96437967 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims
Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.
3 0.95986319 269 acl-2013-PLIS: a Probabilistic Lexical Inference System
Author: Eyal Shnarch ; Erel Segal-haLevi ; Jacob Goldberger ; Ido Dagan
Abstract: This paper presents PLIS, an open source Probabilistic Lexical Inference System which combines two functionalities: (i) a tool for integrating lexical inference knowledge from diverse resources, and (ii) a framework for scoring textual inferences based on the integrated knowledge. We provide PLIS with two probabilistic implementation of this framework. PLIS is available for download and developers of text processing applications can use it as an off-the-shelf component for injecting lexical knowledge into their applications. PLIS is easily configurable, components can be extended or replaced with user generated ones to enable system customization and further research. PLIS includes an online interactive viewer, which is a powerful tool for investigating lexical inference processes. 1 Introduction and background Semantic Inference is the process by which machines perform reasoning over natural language texts. A semantic inference system is expected to be able to infer the meaning of one text from the meaning of another, identify parts of texts which convey a target meaning, and manipulate text units in order to deduce new meanings. Semantic inference is needed for many Natural Language Processing (NLP) applications. For instance, a Question Answering (QA) system may encounter the following question and candidate answer (Example 1): Q: which explorer discovered the New World? A: Christopher Columbus revealed America. As there are no overlapping words between the two sentences, to identify that A holds an answer for Q, background world knowledge is needed to link Christopher Columbus with explorer and America with New World. Linguistic knowledge is also needed to identify that reveal and discover refer to the same concept. Knowledge is needed in order to bridge the gap between text fragments, which may be dissimilar on their surface form but share a common meaning. For the purpose of semantic inference, such knowledge can be derived from various resources (e.g. WordNet (Fellbaum, 1998) and others, detailed in Section 2.1) in a form which we denote as inference links (often called inference/entailment rules), each is an ordered pair of elements in which the first implies the meaning of the second. For instance, the link ship→vessel can be derived from tshtaen hypernym rkel sahtiiopn→ ovfe Wsseolr cdNanet b. Other applications can benefit from utilizing inference links to identify similarity between language expressions. In Information Retrieval, the user’s information need may be expressed in relevant documents differently than it is expressed in the query. Summarization systems should identify text snippets which convey the same meaning. Our work addresses a generic, application in- dependent, setting of lexical inference. We therefore adopt the terminology of Textual Entailment (Dagan et al., 2006), a generic paradigm for applied semantic inference which captures inference needs of many NLP applications in a common underlying task: given two textual fragments, termed hypothesis (H) and text (T), the task is to recognize whether T implies the meaning of H, denoted T→H. For instance, in a QA application, H reprTe→seHnts. Fthoer question, a innd a T Q a c aanpdpilidcaattei answer. pInthis setting, T is likely to hold an answer for the question if it entails the question. It is challenging to properly extract the needed inference knowledge from available resources, and to effectively utilize it within the inference process. The integration of resources, each has its own format, is technically complex and the quality 97 ProceedingSsof oiaf, th Beu 5lg1asrtia A,n Anuuaglu Mst 4ee-9tin 2g0 o1f3. th ?ec A20ss1o3ci Aastisoonci faotrio Cno fomrp Cuotamtipountaalti Loinnaglu Lisitnigcsu,is patigcess 97–102, Figure 1: PLIS schema - a text-hypothesis pair is processed by the Lexical Integrator which uses a set of lexical resources to extract inference chains which connect the two. The Lexical Inference component provides probability estimations for the validity of each level of the process. ofthe resulting inference links is often unknown in advance and varies considerably. For coping with this challenge we developed PLIS, a Probabilistic Lexical Inference System1 . PLIS, illustrated in Fig 1, has two main modules: the Lexical Integra- tor (Section 2) accepts a set of lexical resources and a text-hypothesis pair, and finds all the lexical inference relations between any pair of text term ti and hypothesis term hj, based on the available lexical relations found in the resources (and their combination). The Lexical Inference module (Section 3) provides validity scores for these relations. These term-level scores are used to estimate the sentence-level likelihood that the meaning of the hypothesis can be inferred from the text, thus making PLIS a complete lexical inference system. Lexical inference systems do not look into the structure of texts but rather consider them as bag ofterms (words or multi-word expressions). These systems are easy to implement, fast to run, practical across different genres and languages, while maintaining a competitive level of performance. PLIS can be used as a stand-alone efficient inference system or as the lexical component of any NLP application. PLIS is a flexible system, allowing users to choose the set of knowledge resources as well as the model by which inference 1The complete software package is available at http:// www.cs.biu.ac.il/nlp/downloads/PLIS.html and an online interactive viewer is available for examination at http://irsrv2. cs.biu.ac.il/nlp-net/PLIS.html. is done. PLIS can be easily extended with new knowledge resources and new inference models. It comes with a set of ready-to-use plug-ins for many common lexical resources (Section 2.1) as well as two implementation of the scoring framework. These implementations, described in (Shnarch et al., 2011; Shnarch et al., 2012), provide probability estimations for inference. PLIS has an interactive online viewer (Section 4) which provides a visualization of the entire inference process, and is very helpful for analysing lexical inference models and lexical resources usability. 2 Lexical integrator The input for the lexical integrator is a set of lexical resources and a pair of text T and hypothesis H. The lexical integrator extracts lexical inference links from the various lexical resources to connect each text term ti ∈ T with each hypothesis term hj ∈ H2. A lexical i∈nfTer wenicthe elianckh hinydpicoathteess a semantic∈ rHelation between two terms. It could be a directional relation (Columbus→navigator) or a bai ddiirreeccttiioonnaall one (car ←→ automobile). dSirinecceti knowledge resources vary lien) their representation methods, the lexical integrator wraps each lexical resource in a common plug-in interface which encapsulates resource’s inner representation method and exposes its knowledge as a list of inference links. The implemented plug-ins that come with PLIS are described in Section 2.1. Adding a new lexical resource and integrating it with the others only demands the implementation of the plug-in interface. As the knowledge needed to connect a pair of terms, ti and hj, may be scattered across few resources, the lexical integrator combines inference links into lexical inference chains to deduce new pieces of knowledge, such as Columbus −r −e −so −u −rc −e →2 −r −e −so −u −rc −e →1 navigator explorer. Therefore, the only assumption −t −he − l−e −x →ica elx integrator makes, regarding its input lexical resources, is that the inferential lexical relations they provide are transitive. The lexical integrator generates lexical infer- ence chains by expanding the text and hypothesis terms with inference links. These links lead to new terms (e.g. navigator in the above chain example and t0 in Fig 1) which can be further expanded, as all inference links are transitive. A transitivity 2Where iand j run from 1 to the length of the text and hypothesis respectively. 98 limit is set by the user to determine the maximal length for inference chains. The lexical integrator uses a graph-based representation for the inference chains, as illustrates in Fig 1. A node holds the lemma, part-of-speech and sense of a single term. The sense is the ordinal number of WordNet sense. Whenever we do not know the sense of a term we implement the most frequent sense heuristic.3 An edge represents an inference link and is labeled with the semantic relation of this link (e.g. cytokine→protein is larbeellaetdio wni othf tt hheis sW linokrd (Nee.gt .re clayttiookni hypernym). 2.1 Available plug-ins for lexical resources We have implemented plug-ins for the follow- ing resources: the English lexicon WordNet (Fellbaum, 1998)(based on either JWI, JWNL or extJWNL java APIs4), CatVar (Habash and Dorr, 2003), a categorial variations database, Wikipedia-based resource (Shnarch et al., 2009), which applies several extraction methods to derive inference links from the text and structure of Wikipedia, VerbOcean (Chklovski and Pantel, 2004), a knowledge base of fine-grained semantic relations between verbs, Lin’s distributional similarity thesaurus (Lin, 1998), and DIRECT (Kotlerman et al., 2010), a directional distributional similarity thesaurus geared for lexical inference. To summarize, the lexical integrator finds all possible inference chains (of a predefined length), resulting from any combination of inference links extracted from lexical resources, which link any t, h pair of a given text-hypothesis. Developers can use this tool to save the hassle of interfacing with the different lexical knowledge resources, and spare the labor of combining their knowledge via inference chains. The lexical inference model, described next, provides a mean to decide whether a given hypothesis is inferred from a given text, based on weighing the lexical inference chains extracted by the lexical integrator. 3 Lexical inference There are many ways to implement an inference model which identifies inference relations between texts. A simple model may consider the 3This disambiguation policy was better than considering all senses of an ambiguous term in preliminary experiments. However, it is a matter of changing a variable in the configuration of PLIS to switch between these two policies. 4http://wordnet.princeton.edu/wordnet/related-projects/ number of hypothesis terms for which inference chains, originated from text terms, were found. In PLIS, the inference model is a plug-in, similar to the lexical knowledge resources, and can be easily replaced to change the inference logic. We provide PLIS with two implemented baseline lexical inference models which are mathematically based. These are two Probabilistic Lexical Models (PLMs), HN-PLM and M-PLM which are described in (Shnarch et al., 2011; Shnarch et al., 2012) respectively. A PLM provides probability estimations for the three parts of the inference process (as shown in Fig 1): the validity probability of each inference chain (i.e. the probability for a valid inference relation between its endpoint terms) P(ti → hj), the probability of each hypothesis term to →b e i hnferred by the entire text P(T → hj) (term-level probability), eanntdir teh tee probability o hf the entire hypothesis to be inferred by the text P(T → H) (sentencelteov eble probability). HN-PLM describes a generative process by which the hypothesis is generated from the text. Its parameters are the reliability level of each of the resources it utilizes (that is, the prior probability that applying an arbitrary inference link derived from each resource corresponds to a valid inference). For learning these parameters HN-PLM applies a schema of the EM algorithm (Dempster et al., 1977). Its performance on the recognizing textual entailment task, RTE (Bentivogli et al., 2009; Bentivogli et al., 2010), are in line with the state of the art inference systems, including complex systems which perform syntactic analysis. This model is improved by M-PLM, which deduces sentence-level probability from term-level probabilities by a Markovian process. PLIS with this model was used for a passage retrieval for a question answering task (Wang et al., 2007), and outperformed state of the art inference systems. Both PLMs model the following prominent aspects of the lexical inference phenomenon: (i) considering the different reliability levels of the input knowledge resources, (ii) reducing inference chain probability as its length increases, and (iii) increasing term-level probability as we have more inference chains which suggest that the hypothesis term is inferred by the text. Both PLMs only need sentence-level annotations from which they derive term-level inference probabilities. To summarize, the lexical inference module 99 ?(? → ?) Figure 2: PLIS interactive viewer with Example 1 demonstrates knowledge integration of multiple inference chains and resource combination (additional explanations, which are not part of the demo, are provided in orange). provides the setting for interfacing with the lexical integrator. Additionally, the module provides the framework for probabilistic inference models which estimate term-level probabilities and integrate them into a sentence-level inference decision, while implementing prominent aspects of lexical inference. The user can choose to apply another inference logic, not necessarily probabilistic, by plugging a different lexical inference model into the provided inference infrastructure. 4 The PLIS interactive system PLIS comes with an online interactive viewer5 in which the user sets the parameters of PLIS, inserts a text-hypothesis pair and gets a visualization of the entire inference process. This is a powerful tool for investigating knowledge integration and lexical inference models. Fig 2 presents a screenshot of the processing of Example 1. On the right side, the user configures the system by selecting knowledge resources, adjusting their configuration, setting the transitivity limit, and choosing the lexical inference model to be applied by PLIS. After inserting a text and a hypothesis to the appropriate text boxes, the user clicks on the infer button and PLIS generates all lexical inference chains, of length up to the transitivity limit, that connect text terms with hypothesis terms, as available from the combination of the selected input re5http://irsrv2.cs.biu.ac.il/nlp-net/PLIS.html sources. Each inference chain is presented in a line between the text and hypothesis. PLIS also displays the probability estimations for all inference levels; the probability of each chain is presented at the end of its line. For each hypothesis term, term-level probability, which weighs all inference chains found for it, is given below the dashed line. The overall sentence-level probability integrates the probabilities of all hypothesis terms and is displayed in the box at the bottom right corner. Next, we detail the inference process of Example 1, as presented in Fig 2. In this QA example, the probability of the candidate answer (set as the text) to be relevant for the given question (the hypothesis) is estimated. When utilizing only two knowledge resources (WordNet and Wikipedia), PLIS is able to recognize that explorer is inferred by Christopher Columbus and that New World is inferred by America. Each one of these pairs has two independent inference chains, numbered 1–4, as evidence for its inference relation. Both inference chains 1 and 3 include a single inference link, each derived from a different relation of the Wikipedia-based resource. The inference model assigns a higher probability for chain 1since the BeComp relation is much more reliable than the Link relation. This comparison illustrates the ability of the inference model to learn how to differ knowledge resources by their reliability. Comparing the probability assigned by the in100 ference model for inference chain 2 with the probabilities assigned for chains 1 and 3, reveals the sophisticated way by which the inference model integrates lexical knowledge. Inference chain 2 is longer than chain 1, therefore its probability is lower. However, the inference model assigns chain 2 a higher probability than chain 3, even though the latter is shorter, since the model is sensitive enough to consider the difference in reliability levels between the two highly reliable hypernym relations (from WordNet) of chain 2 and the less reliable Link relation (from Wikipedia) of chain 3. Another aspect of knowledge integration is exemplified in Fig 2 by the three circled probabilities. The inference model takes into consideration the multiple pieces of evidence for the inference of New World (inference chains 3 and 4, whose probabilities are circled). This results in a termlevel probability estimation for New World (the third circled probability) which is higher than the probabilities of each chain separately. The third term of the hypothesis, discover, remains uncovered by the text as no inference chain was found for it. Therefore, the sentence-level inference probability is very low, 37%. In order to identify that the hypothesis is indeed inferred from the text, the inference model should be provided with indications for the inference of discover. To that end, the user may increase the transitivity limit in hope that longer inference chains provide the needed information. In addition, the user can examine other knowledge resources in search for the missing inference link. In this example, it is enough to add VerbOcean to the input of PLIS to expose two inference chains which connect reveal with discover by combining an inference link from WordNet and another one from VerbOcean. With this additional information, the sentence-level probability increases to 76%. This is a typical scenario of utilizing PLIS, either via the interactive system or via the software, for analyzing the usability of the different knowledge resources and their combination. A feature of the interactive system, which is useful for lexical resources analysis, is that each term in a chain is clickable and links to another screen which presents all the terms that are inferred from it and those from which it is inferred. Additionally, the interactive system communicates with a server which runs PLIS, in a fullduplex WebSocket connection6. This mode of operation is publicly available and provides a method for utilizing PLIS, without having to install it or the lexical resources it uses. Finally, since PLIS is a lexical system it can easily be adjusted to other languages. One only needs to replace the basic lexical text processing tools and plug in knowledge resources in the target language. If PLIS is provided with bilingual resources,7 it can operate also as a cross-lingual inference system (Negri et al., 2012). For instance, the text in Fig 3 is given in English, while the hypothesis is written in Spanish (given as a list of lemma:part-of-speech). The left side of the figure depicts a cross-lingual inference process in which the only lexical knowledge resource used is a man- ually built English-Spanish dictionary. As can be seen, two Spanish terms, jugador and casa remain uncovered since the dictionary alone cannot connect them to any of the English terms in the text. As illustrated in the right side of Fig 3, PLIS enables the combination of the bilingual dictionary with monolingual resources to produce cross-lingual inference chains, such as footballer−h −y −p −er−n y −m →player− −m −a −nu − →aljugador. Such inferenc−e − c−h −a −in − →s hpalavey trh− e− capability otro. overcome monolingual language variability (the first link in this chain) as well as to provide cross-lingual translation (the second link). 5 Conclusions To utilize PLIS one should gather lexical resources, obtain sentence-level annotations and train the inference model. Annotations are available in common data sets for task such as QA, Information Retrieval (queries are hypotheses and snippets are texts) and Student Response Analysis (reference answers are the hypotheses that should be inferred by the student answers). For developers of NLP applications, PLIS offers a ready-to-use lexical knowledge integrator which can interface with many common lexical knowledge resources and constructs lexical inference chains which combine the knowledge in them. A developer who wants to overcome lexical language variability, or to incorporate background knowledge, can utilize PLIS to inject lex6We used the socket.io implementation. 7A bilingual resource holds inference links which connect terms in different languages (e.g. an English-Spanish dictionary can provide the inference link explorer→explorador). 101 Figure 3 : PLIS as a cross-lingual inference system. Left: the process with a single manual bilingual resource. Right: PLIS composes cross-lingual inference chains to increase hypothesis coverage and increase sentence-level inference probability. ical knowledge into any text understanding application. PLIS can be used as a lightweight inference system or as the lexical component of larger, more complex inference systems. Additionally, PLIS provides scores for infer- ence chains and determines the way to combine them in order to recognize sentence-level inference. PLIS comes with two probabilistic lexical inference models which achieved competitive performance levels in the tasks of recognizing textual entailment and passage retrieval for QA. All aspects of PLIS are configurable. The user can easily switch between the built-in lexical resources, inference models and even languages, or extend the system with additional lexical resources and new inference models. Acknowledgments The authors thank Eden Erez for his help with the interactive viewer and Miquel Espl a` Gomis for the bilingual dictionaries. This work was partially supported by the European Community’s 7th Framework Programme (FP7/2007-2013) under grant agreement no. 287923 (EXCITEMENT) and the Israel Science Foundation grant 880/12. References Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Timothy Chklovski and Patrick Pantel. 2004. VerbOcean: Mining the web for fine-grained semantic verb relations. In Proc. of EMNLP. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, series [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for English. In Proc. of NAACL. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. 2010. Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4):359–389. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proc. of COLOING-ACL. Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2012. Semeval-2012 task 8: Cross-lingual textual entailment for content synchronization. In Proc. of SemEval. Eyal Shnarch, Libby Barak, and Ido Dagan. 2009. Extracting lexical reference rules from Wikipedia. In Proc. of ACL. Eyal Shnarch, Jacob Goldberger, and Ido Dagan. 2011. Towards a probabilistic model for lexical entailment. In Proc. of the TextInfer Workshop. Eyal Shnarch, Ido Dagan, and Jacob Goldberger. 2012. A probabilistic lexical model for ranking textual inferences. In Proc. of *SEM. Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the Jeopardy model? A quasisynchronous grammar for QA. In Proc. of EMNLP. 102
Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou
Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.
5 0.94341189 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers
Author: Andre Martins ; Miguel Almeida ; Noah A. Smith
Abstract: We present fast, accurate, direct nonprojective dependency parsers with thirdorder features. Our approach uses AD3, an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models. Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-ofthe-art accuracies for the largest datasets (English, Czech, and German).
6 0.93929619 277 acl-2013-Part-of-speech tagging with antagonistic adversaries
7 0.88721073 284 acl-2013-Probabilistic Sense Sentiment Similarity through Hidden Emotions
8 0.81826746 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
9 0.7848894 118 acl-2013-Development and Analysis of NLP Pipelines in Argo
10 0.73090255 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
11 0.71650541 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP
12 0.66921437 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit
13 0.65767711 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
14 0.64992374 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE
15 0.64082575 297 acl-2013-Recognizing Partial Textual Entailment
16 0.61619949 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments
17 0.60963762 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations
18 0.60913104 237 acl-2013-Margin-based Decomposed Amortized Inference
19 0.60121691 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning
20 0.59442586 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us