acl acl2013 acl2013-262 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire
Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.
Reference: text
sentIndex sentText sentNum sentScore
1 We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. [sent-16, score-0.192]
2 We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. [sent-17, score-0.503]
3 We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. [sent-19, score-0.171]
4 Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field. [sent-20, score-0.198]
5 In this paper, we argue that the value of research that attempts to replicate previous approaches goes beyond simply validating what is already known. [sent-23, score-0.13]
6 Especially when validation fails or variations in results are found, systematic testing helps to obtain a clearer picture of both the approach itself and of the meaning of state-of-theart results leading to a better insight into the quality of new approaches in relation to previous work. [sent-25, score-0.154]
7 We support our claims by presenting two use cases that aim to reproduce results of previous work in two key NLP technologies: measuring WordNet similarity and Named Entity Recognition (NER). [sent-26, score-0.182]
8 This last point shows that reproducing results is not merely part of good practice in science, but also an essential part in gaining a better understanding of the methods we use. [sent-28, score-0.155]
9 Likewise, the problems we face in reproducing previous results are not merely frustrating inconveniences, but also pointers to research questions that deserve deeper investigation. [sent-29, score-0.248]
10 We investigated five aspects that cause experimental variation that are not typically described in publications: preprocessing (e. [sent-30, score-0.209]
11 the exact features used for individual tokens in NER), and system variation (e. [sent-38, score-0.201]
12 As such, reproduction provides a platform for systematically testing individual aspects of an approach that contribute to a given result. [sent-41, score-0.609]
13 The WordNet similarity experiment use case compares the performance of different similarity measures. [sent-49, score-0.248]
14 We will show that the answer as to which measure works best changes depending on factors such as the gold standard used, the strategy towards partof-speech or the ranking coefficient, all aspects that are typically not addressed in the literature. [sent-50, score-0.268]
15 com/ ant ske / WordNet S imi larity, and for the NER experiments at http : / / github . [sent-52, score-0.15]
16 The experiments presented in this paper have been repeated by colleagues not involved in the development of the software using the code included in these repositories. [sent-54, score-0.156]
17 2 Background This section provides a brief overview of recent work addressing reproduction and benchmark results in computer science related studies and discusses how our research fits in the overall picture. [sent-60, score-0.445]
18 Most researchers agree that validating results entails that a method should lead to the same overall conclusions rather than producing the exact same numbers (Drummond, 2009; Dalle, 2012; Buchert and Nussbaum, 2012, etc. [sent-61, score-0.178]
19 According to Drummond (2009) replication is not interesting, since it does not lead to new insights. [sent-65, score-0.283]
20 On this point we disagree with Drummond (2009) as replication allows us to: 1) validate prior research, 2) improve on prior research without having to rebuild software from scratch, and 3) compare results of reimplementations and obtain the necessary insights to perform reproduction experiments. [sent-66, score-0.834]
21 The outcome of our use cases confirms the statement that deeper insights into an approach can be obtained when all resources are available, an observation also made by Ince et al. [sent-67, score-0.132]
22 Even if exact replication is not a goal many strive for, Ince et al. [sent-69, score-0.333]
23 (2012) argue that insightful reproduction can be an (almost) impossible undertaking without the source code being available. [sent-70, score-0.516]
24 Moreover, it is not always clear where replication stops and reproduction begins. [sent-71, score-0.686]
25 Dalle (2012) distinguishes levels of reproducing results related to how close they are to the original work and how each contributes to research. [sent-72, score-0.155]
26 In general, an increasing awareness of the importance of reproduction research and open code and data can be observed based on publications in high-profile journals (e. [sent-73, score-0.556]
27 A handful of use cases on reproducing or replicating results have been published. [sent-82, score-0.192]
28 Louridas and Gousios (2012) present a use case revealing that source code alone is not enough for reproducing 1http : / /www . [sent-83, score-0.226]
29 This includes both observations that a detailed level of information is required for truly insightful reproduction research as well as the claim that such research leads to better understanding of our techniques. [sent-89, score-0.498]
30 (2012) propose to use experimental databases to systematically test variations for machine learning, but neither links the two issues together. [sent-97, score-0.18]
31 We cannot control how fellow researchers carry out their evaluation, but if we have an idea of the variations that typically occur within a system, we can better compare approaches for which not all details are known. [sent-103, score-0.146]
32 3 WordNet Similarity Measures Patwardhan and Pedersen (2006) and Pedersen (2010) present studies where the output of a variety of WordNet similarity and relatedness measures are compared. [sent-104, score-0.226]
33 They rank Miller and Charles (1991)’s set (henceforth “mc-set”) of 30 word pairs according to their semantic relatedness with several WordNet similarity measures. [sent-105, score-0.133]
34 4 The fact that results of similarity measures on WordNet can differ even while the same software and same versions are used indicates that properties which are not addressed in the literature may influence the output of similarity measures. [sent-117, score-0.5]
35 We therefore conducted a range of experiments that, in addition to searching for the right settings to replicate results of previous research, address the following questions: 1) Which properties have an impact on the performance of WordNet similarity measures? [sent-118, score-0.187]
36 3) How do commonly used measures compare when the variation of their performance are taken into account? [sent-120, score-0.203]
37 2 Methodology and first observations The questions above were addressed in two stages. [sent-122, score-0.154]
38 In the first stage, Fokkens, who was not involved in the first replication attempt implemented a script to calculate similarity measures using WordNet::Similarity. [sent-123, score-0.431]
39 This included similarity measures introduced by Wu and Palmer (1994) (wup), 3Obtained from http : / / t al i ke r . [sent-124, score-0.227]
40 edu / s cgi-bin / s imi l arity / s imi l arity . [sent-127, score-0.228]
41 First, we made sure that the script implemented by Fokkens could produce the same WordNet similarity scores for each individual word pair as those used to calculate the ranking on the mc-set by Pedersen (2010). [sent-138, score-0.204]
42 Finally, the gold standard and exact implementation of the Spearman ranking coefficient were compared. [sent-139, score-0.226]
43 The first replication attempt had restricted PoS-tags to nouns based on the idea that most items are nouns and subjects would be primed to primarily think of the noun senses. [sent-144, score-0.241]
44 Pos-tags were not restricted in the second replication attempt, but because of a bug in the code only the first identified PoS-tag (“noun” in all cases) was considered. [sent-146, score-0.312]
45 We therefore mistakenly assumed that PoS-tag restrictions did not matter until we compared individual scores between Pedersen and the replication attempts. [sent-147, score-0.342]
46 Again, there is no reason why one gold standard would be a better choice than the other, but in order to replicate results, it must be known which of the two was used. [sent-150, score-0.134]
47 The influence of the exact gold standard and calculation of Spearman ρ could only be found because Pedersen could pro- vide the output of the similarity measures he used to calculate the coefficient. [sent-152, score-0.363]
48 Finally, results for l ch, le sk and wup changed according to measure specific configuration settings such as including a PoS-tag specific root node or turning on normalisation. [sent-154, score-0.242]
49 In the second stage of this research, we ran experiments that systematically manipulate the influential factors described above. [sent-155, score-0.133]
50 3 Variation per measure All measures varied in their performance. [sent-160, score-0.185]
51 The complete outcome of our experiments (both the similarity measures assigned to each pair as well as the output of the ranking coefficients) are included in the data set provided at http : / / github . [sent-161, score-0.335]
52 The last column indicates the variation of performance of a measure 1694 compared to the other measures, where 1 is the best performing measure and 12 is the worst. [sent-169, score-0.218]
53 Which measure performs best depends on the evaluation set, ranking coefficient, PoS-tag restrictions and configuration settings. [sent-176, score-0.246]
54 This means that the answer to the question of which similarity measure is best to mimic human similarity scores depends on aspects that are often not even mentioned, let alone systematically compared. [sent-177, score-0.377]
55 4 Variation per category For each influential category of experimental variation, we compared the variation in Spearman ρ and Kendall τ, while similarity measure and other influential categories were kept stable. [sent-179, score-0.379]
56 The categories we varied include WordNet and WordNet::Similarity version, the gold standard used to evaluate, restrictions on PoS-tags, and measure specific configurations. [sent-180, score-0.202]
57 Table 2 presents the maximum variation found across measures for each category. [sent-181, score-0.203]
58 The last column indicates how often the ranking of a specific measure changed as the category changed, e. [sent-182, score-0.168]
59 did the measure ranking third using specific configurations, PoS-tag restrictions and a specific gold standard using WordNet 2. [sent-184, score-0.236]
60 Note that this number changes for each category, because we com5Some measures ranked differently as their individual configuration settings changed. [sent-188, score-0.182]
61 In these cases, the measure was included in the overall ranking multiple times, which is why there are more ranking positions than measures. [sent-189, score-0.235]
62 d2c640a1 28letτgor3Dy25a30in978fk( e529rt0 eo42)nt pared two WordNet versions (WN version), three gold standard and PoS-tag restriction variations and configuration only for the subset of scores where configuration matters. [sent-194, score-0.307]
63 We included τ, because many authors do not mention the ranking coefficient they use (cf. [sent-211, score-0.163]
64 Except for WordNet, which Budanitsky and Hirst (2006) hold accountable for minor variations in a footnote, the influential categories we investigated in this paper, to our knowledge, have not yet been addressed in the literature. [sent-213, score-0.208]
65 Cramer (2008) points out that results from WordNet-Human similarity correlations lead to scattered results reporting variations similar to ours, but she compares studies using different measures, data and experimental setup. [sent-214, score-0.29]
66 Table 1 reveals a wide variation in ranking relative to alternative approaches. [sent-216, score-0.182]
67 Results in Table 2 show that it is common for the ranking of a score to change due to variations that are not at the core of the method. [sent-217, score-0.178]
68 This study shows that it is far from clear how different WordNet similarity measures relate to each other. [sent-218, score-0.19]
69 This is also the reason why we presented the maximum variation observed, rather than the average or typical variation (mostly below 0. [sent-222, score-0.22]
70 The results of the best run in our first reproduction attempt, together with the original results from Freire et al. [sent-248, score-0.445]
71 2 Following up from reproduction Since the experiments in Van Erp and Van der Meij (2013) introduce several new research questions regarding the influence of data cleaning and the limitations of the dataset, we performed some additional experiments. [sent-252, score-0.642]
72 , 2ec0a1ll2) reFsβu=lts1VParenci Esiropn andR Vaenca dller MFeiβj’=s1 replication results LO PEROvReGCra( 6l(3l1 (584178), 159) 9 91201% % 5 5 5 76% % 76 6909 67 73537. [sent-255, score-0.241]
73 2012 and our replication of their approach as presented in Van Erp and Van der Meij (2013) tokens vs 12,510), and a 15 point drop in overall F-score. [sent-259, score-0.307]
74 This leads to questions about the differences between the CRF implementations and the influence of their parameters, which we hope to investigate in future work. [sent-288, score-0.131]
75 The novelty and replication problems lie in the first three steps. [sent-293, score-0.241]
76 These settings are not mentioned in the paper, making reproduction very difficult. [sent-304, score-0.445]
77 5 Observations In this section, we generalise the observations from our use cases to the main categories that can influence reproduction. [sent-306, score-0.163]
78 In order to check the output of a reproduction ex- periment at every step of the way, system output of experiments, including intermediate steps, is vital. [sent-317, score-0.48]
79 The WordNet replication was only possible, because Pedersen could provide the similarity scores of each word pair. [sent-318, score-0.338]
80 This was not observed in our experiments, but such variations may be determined by running an experiment several times and taking the average over the different runs (cf. [sent-322, score-0.16]
81 (2012) propose a setup that allows researchers to provide their full experimental setup, which should include exact steps followed in preprocessing the data, documentation of the experimental setup, exact versions of the software and resources used and experimental output. [sent-327, score-0.342]
82 Having access to such a setup allows other researchers to validate research, but also tweak the approach to investigate system variation, systematically test the approach in order to learn its limitations and strengths and ultimately improve on it. [sent-328, score-0.163]
83 6 Discussion Many of the aspects addressed in the previous section such as preprocessing are typically only mentioned in passing, or not at all. [sent-329, score-0.142]
84 However, they do represent two core technologies and our observations align with previous literature on replication and reproduction. [sent-342, score-0.294]
85 Despite the systematic variation we employed 1698 in our experiments, they do not answer all questions that the problems in reproduction evoked. [sent-343, score-0.661]
86 For the WordNet experiments, deeper analysis is required to gain full understanding of how individual influential aspects interact with each measurement. [sent-344, score-0.184]
87 This could be stimulated by instituting reproduction tracks in conferences, thus rewarding systematic investigation of research approaches. [sent-349, score-0.493]
88 7 Conclusion We have presented two reproduction use cases for the NLP domain. [sent-352, score-0.482]
89 We show that repeating other researchers’ experiments can lead to new research questions and provide new insights into and better understanding of the investigated techniques. [sent-353, score-0.2]
90 Our WordNet experiments show that the performance of similarity measures can be influenced by the PoS-tags considered, measure specific variations, the rank coefficient and the gold standard used for comparison. [sent-354, score-0.342]
91 We not only find that such variations lead to different numbers, but also different rankings of the individual measures, i. [sent-355, score-0.183]
92 these aspects lead to a different answer to the question as to which measure performs best. [sent-357, score-0.151]
93 We did not succeed in reproducing the NER results of Freire et al. [sent-358, score-0.155]
94 (2012), showing the complexity of what seems a straightforward reproduction case based on a system description and training data only. [sent-359, score-0.445]
95 Some techniques are reused so often (the papers introducing WordNet similarity measures have around 1,0002,000 citations each as of February 2013, for example) that knowing their strengths and weaknesses is essential for optimising their use. [sent-368, score-0.19]
96 10 But most of all: when reproduction fails, regardless of whether original code or a reimplementation was used, valuable insights can emerge from investigating the cause of this failure. [sent-371, score-0.576]
97 We furthermore thank Ruben Izquierdo, Lourens van der Meij, Christoph Zwirello, Rebecca Dridan and the Semantic Web Group at VU University for their help and useful feedback. [sent-375, score-0.158]
98 Using wordnet based context vectors to estimate the semantic relatedness of concepts. [sent-497, score-0.237]
99 Information content measures of semantic similarity perform better without sensetagged text. [sent-505, score-0.19]
100 Piek Vossen, Isa Maks, Roxane Segers, Hennie van der Vliet, Marie-Francine Moens, Katja Hofmann, Erik Tjong Kim Sang, and Maarten de Rijke. [sent-539, score-0.158]
wordName wordTfidf (topN-words)
[('reproduction', 0.445), ('freire', 0.303), ('replication', 0.241), ('wordnet', 0.201), ('pedersen', 0.198), ('spearman', 0.158), ('kendall', 0.156), ('reproducing', 0.155), ('ner', 0.141), ('erp', 0.132), ('variation', 0.11), ('variations', 0.106), ('similarity', 0.097), ('measures', 0.093), ('meij', 0.093), ('van', 0.092), ('replicate', 0.09), ('fokkens', 0.081), ('zigglebottom', 0.081), ('netherlands', 0.078), ('systematically', 0.074), ('imi', 0.074), ('influence', 0.073), ('vu', 0.073), ('ranking', 0.072), ('ince', 0.071), ('code', 0.071), ('patwardhan', 0.066), ('der', 0.066), ('restrictions', 0.066), ('miller', 0.061), ('raeder', 0.061), ('vanschoren', 0.061), ('viaf', 0.061), ('insights', 0.06), ('influential', 0.059), ('questions', 0.058), ('exact', 0.056), ('charles', 0.056), ('aspects', 0.055), ('measure', 0.054), ('coefficient', 0.054), ('configuration', 0.054), ('experiment', 0.054), ('drummond', 0.054), ('umn', 0.054), ('observations', 0.053), ('setup', 0.049), ('versions', 0.049), ('reproduce', 0.048), ('software', 0.048), ('systematic', 0.048), ('rubenstein', 0.047), ('wup', 0.047), ('ted', 0.047), ('amsterdam', 0.046), ('points', 0.045), ('sk', 0.045), ('initiatives', 0.044), ('mallet', 0.044), ('preprocessing', 0.044), ('gold', 0.044), ('hirst', 0.043), ('addressed', 0.043), ('piek', 0.042), ('vossen', 0.042), ('changed', 0.042), ('lead', 0.042), ('buchert', 0.04), ('cornetto', 0.04), ('decimals', 0.04), ('hatton', 0.04), ('howison', 0.04), ('journals', 0.04), ('lch', 0.04), ('louridas', 0.04), ('marieke', 0.04), ('reimplementations', 0.04), ('ske', 0.04), ('researchers', 0.04), ('validating', 0.04), ('repeating', 0.04), ('arity', 0.04), ('budanitsky', 0.038), ('varied', 0.038), ('cases', 0.037), ('included', 0.037), ('github', 0.036), ('relatedness', 0.036), ('nuno', 0.036), ('neylon', 0.036), ('strive', 0.036), ('lourens', 0.036), ('dalle', 0.036), ('geonames', 0.036), ('versioning', 0.036), ('intermediate', 0.035), ('individual', 0.035), ('deeper', 0.035), ('folds', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire
Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.
2 0.15550081 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims
Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.
3 0.13327445 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
Author: Mohammad Taher Pilehvar ; David Jurgens ; Roberto Navigli
Abstract: Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: seman- tic textual similarity, word similarity, and word sense coarsening.
4 0.11672298 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity
Author: Daniel Bar ; Torsten Zesch ; Iryna Gurevych
Abstract: We present DKPro Similarity, an open source framework for text similarity. Our goal is to provide a comprehensive repository of text similarity measures which are implemented using standardized interfaces. DKPro Similarity comprises a wide variety of measures ranging from ones based on simple n-grams and common subsequences to high-dimensional vector comparisons and structural, stylistic, and phonetic measures. In order to promote the reproducibility of experimental results and to provide reliable, permanent experimental conditions for future studies, DKPro Similarity additionally comes with a set of full-featured experimental setups which can be run out-of-the-box and be used for future systems to built upon.
5 0.11547811 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
Author: Francis Bond ; Ryan Foster
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
6 0.10111614 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments
7 0.10088781 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models
8 0.087788589 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit
9 0.082493827 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
10 0.081670217 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
11 0.076057605 169 acl-2013-Generating Synthetic Comparable Questions for News Articles
12 0.073906884 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models
13 0.073889419 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval
14 0.069057815 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner
15 0.067567088 257 acl-2013-Natural Language Models for Predicting Programming Comments
16 0.067282945 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
17 0.066446066 139 acl-2013-Entity Linking for Tweets
18 0.064577587 390 acl-2013-Word surprisal predicts N400 amplitude during reading
19 0.063786156 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
20 0.061764631 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates
topicId topicWeight
[(0, 0.193), (1, 0.064), (2, 0.022), (3, -0.126), (4, -0.011), (5, -0.061), (6, -0.075), (7, -0.035), (8, 0.051), (9, -0.022), (10, -0.021), (11, 0.017), (12, -0.049), (13, -0.059), (14, 0.056), (15, 0.041), (16, 0.008), (17, 0.018), (18, -0.01), (19, -0.021), (20, 0.005), (21, -0.011), (22, 0.001), (23, -0.027), (24, -0.039), (25, 0.068), (26, -0.001), (27, -0.037), (28, -0.016), (29, -0.056), (30, -0.02), (31, -0.097), (32, 0.021), (33, -0.05), (34, -0.024), (35, -0.006), (36, 0.033), (37, 0.059), (38, 0.004), (39, 0.047), (40, -0.027), (41, 0.02), (42, -0.093), (43, -0.03), (44, -0.046), (45, -0.018), (46, -0.045), (47, -0.097), (48, -0.013), (49, -0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.93682832 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire
Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.
2 0.81880593 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims
Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.
3 0.80886829 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments
Author: Tony Veale ; Guofu Li
Abstract: Just as observing is more than just seeing, comparing is far more than mere matching. It takes understanding, and even inventiveness, to discern a useful basis for judging two ideas as similar in a particular context, especially when our perspective is shaped by an act of linguistic creativity such as metaphor, simile or analogy. Structured resources such as WordNet offer a convenient hierarchical means for converging on a common ground for comparison, but offer little support for the divergent thinking that is needed to creatively view one concept as another. We describe such a means here, by showing how the web can be used to harvest many divergent views for many familiar ideas. These lateral views complement the vertical views of WordNet, and support a system for idea exploration called Thesaurus Rex. We show also how Thesaurus Rex supports a novel, generative similarity measure for WordNet. 1 Seeing is Believing (and Creating) Similarity is a cognitive phenomenon that is both complex and subjective, yet for practical reasons it is often modeled as if it were simple and objective. This makes sense for the many situations where we want to align our similarity judgments with those of others, and thus focus on the same conventional properties that others are also likely to focus upon. This reliance on the consensus viewpoint explains why WordNet (Fellbaum, 1998) has proven so useful as a basis for computational measures of lexico-semantic similarity Guofu Li School of Computer Science and Informatics, University College Dublin, Belfield, Dublin D2, Ireland. l .guo fu . l gmai l i @ .com (e.g. see Pederson et al. 2004, Budanitsky & Hirst, 2006; Seco et al. 2006). These measures reduce the similarity of two lexical concepts to a single number, by viewing similarity as an objective estimate of the overlap in their salient qualities. This convenient perspective is poorly suited to creative or insightful comparisons, but it is sufficient for the many mundane comparisons we often perform in daily life, such as when we organize books or look for items in a supermarket. So if we do not know in which aisle to locate a given item (such as oatmeal), we may tacitly know how to locate a similar product (such as cornflakes) and orient ourselves accordingly. Yet there are occasions when the recognition of similarities spurs the creation of similarities, when the act of comparison spurs us to invent new ways of looking at an idea. By placing pop tarts in the breakfast aisle, food manufacturers encourage us to view them as a breakfast food that is not dissimilar to oatmeal or cornflakes. When ex-PM Tony Blair published his memoirs, a mischievous activist encouraged others to move his book from Biography to Fiction in bookshops, in the hope that buyers would see it in a new light. Whenever we use a novel metaphor to convey a non-obvious viewpoint on a topic, such as “cigarettes are time bombs”, the comparison may spur us to insight, to see aspects of the topic that make it more similar to the vehicle (see Ortony, 1979; Veale & Hao, 2007). In formal terms, assume agent A has an insight about concept X, and uses the metaphor X is a Y to also provoke this insight in agent B. To arrive at this insight for itself, B must intuit what X and Y have in common. But this commonality is surely more than a standard categorization of X, or else it would not count as an insight about X. To understand the metaphor, B must place X 660 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 6 0–670, in a new category, so that X can be seen as more similar to Y. Metaphors shape the way we per- ceive the world by re-shaping the way we make similarity judgments. So if we want to imbue computers with the ability to make and to understand creative metaphors, we must first give them the ability to look beyond the narrow viewpoints of conventional resources. Any measure that models similarity as an objective function of a conventional worldview employs a convergent thought process. Using WordNet, for instance, a similarity measure can vertically converge on a common superordinate category of both inputs, and generate a single numeric result based on their distance to, and the information content of, this common generalization. So to find the most conventional ways of seeing a lexical concept, one simply ascends a narrowing concept hierarchy, using a process de Bono (1970) calls vertical thinking. To find novel, non-obvious and useful ways of looking at a lexical concept, one must use what Guilford (1967) calls divergent thinking and what de Bono calls lateral thinking. These processes cut across familiar category boundaries, to simultaneously place a concept in many different categories so that we can see it in many different ways. de Bono argues that vertical thinking is selective while lateral thinking is generative. Whereas vertical thinking concerns itself with the “right” way or a single “best” way of looking at things, lateral thinking focuses on producing alternatives to the status quo. To be as useful for creative tasks as they are for conventional tasks, we need to re-imagine our computational similarity measures as generative rather than selective, expansive rather than reductive, divergent as well as convergent and lateral as well as vertical. Though WordNet is ideally structured to support vertical, convergent reasoning, its comprehensive nature means it can also be used as a solid foundation for building a more lateral and divergent model of similarity. Here we will use the web as a source of diverse perspectives on familiar ideas, to complement the conventional and often narrow views codified by WordNet. Section 2 provides a brief overview of past work in the area of similarity measurement, before section 3 describes a simple bootstrapping loop for acquiring richly diverse perspectives from the web for a wide variety of familiar ideas. These perspectives are used to enhance a Word- Net-based measure of lexico-semantic similarity in section 4, by broadening the range of informative viewpoints the measure can select from. Similarity is thus modeled as a process that is both generative and selective. This lateral-andvertical approach is evaluated in section 5, on the Miller & Charles (1991) data-set. A web app for the lateral exploration of diverse viewpoints, named Thesaurus Rex, is also presented, before closing remarks are offered in section 6. 2 Related Work and Ideas WordNet’s taxonomic organization of nounsenses and verb-senses – in which very general categories are successively divided into increasingly informative sub-categories or instancelevel ideas – allows us to gauge the overlap in information content, and thus of meaning, of two lexical concepts. We need only identify the deepest point in the taxonomy at which this content starts to diverge. This point of divergence is often called the LCS, or least common subsumer, of two concepts (Pederson et al., 2004). Since sub-categories add new properties to those they inherit from their parents – Aristotle called these properties the differentia that stop a category system from trivially collapsing into itself – the depth of a lexical concept in a taxonomy is an intuitive proxy for its information content. Wu & Palmer (1994) use the depth of a lexical concept in the WordNet hierarchy as such a proxy, and thereby estimate the similarity of two lexical concepts as twice the depth of their LCS divided by the sum of their individual depths. Leacock and Chodorow (1998) instead use the length of the shortest path between two concepts as a proxy for the conceptual distance between them. To connect any two ideas in a hierarchical system, one must vertically ascend the hierarchy from one concept, change direction at a potential LCS, and then descend the hierarchy to reach the second concept. (Aristotle was also first to suggest this approach in his Poetics). Leacock and Chodorow normalize the length of this path by dividing its size (in nodes) by twice the depth of the deepest concept in the hierarchy; the latter is an upper bound on the distance between any two concepts in the hierarchy. Negating the log of this normalized length yields a corresponding similarity score. While the role of an LCS is merely implied in Leacock and Chodorow’s use of a shortest path, the LCS is pivotal nonetheless, and like that of Wu & Palmer, the approach uses an essentially vertical reasoning process to identify a single “best” generalization. Depth is a convenient proxy for information content, but more nuanced proxies can yield 661 more rounded similarity measures. Resnick (1995) draws on information theory to define the information content of a lexical concept as the negative log likelihood of its occurrence in a corpus, either explicitly (via a direct mention) or by presupposition (via a mention of any of its sub-categories or instances). Since the likelihood of a general category occurring in a corpus is higher than that of any of its sub-categories or instances, such categories are more predictable, and less informative, than rarer categories whose occurrences are less predictable and thus more informative. The negative log likelihood of the most informative LCS of two lexical concepts offers a reliable estimate of the amount of infor- mation shared by those concepts, and thus a good estimate of their similarity. Lin (1998) combines the intuitions behind Resnick’s metric and that of Wu and Palmer to estimate the similarity of two lexical concepts as an information ratio: twice the information content of their LCS divided by the sum of their individual information contents. Jiang and Conrath (1997) consider the converse notion of dissimilarity, noting that two lexical concepts are dissimilar to the extent that each contains information that is not shared by the other. So if the information content of their most informative LCS is a good measure of what they do share, then the sum of their individual information contents, minus twice the content of their most informative LCS, is a reliable estimate of their dissimilarity. Seco et al. (2006) presents a minor innovation, showing how Resnick’s notion of information content can be calculated without the use of an external corpus. Rather, when using Resnick’s metric (or that of Lin, or Jiang and Conrath) for measuring the similarity of lexical concepts in WordNet, one can use the category structure of WordNet itself to estimate infor- mation content. Typically, the more general a concept, the more descendants it will possess. Seco et al. thus estimate the information content of a lexical concept as the log of the sum of all its unique descendants (both direct and indirect), divided by the log of the total number of concepts in the entire hierarchy. Not only is this intrinsic view of information content convenient to use, without recourse to an external corpus, Seco et al. show that it offers a better estimate of information content than its extrinsic, corpus-based alternatives, as measured relative to average human similarity ratings for the 30 word-pairs in the Miller & Charles (1991) test set. A similarity measure can draw on other sources of information besides WordNet’s category structures. One might eke out additional information from WordNet’s textual glosses, as in Lesk (1986), or use category structures other than those offered by WordNet. Looking beyond WordNet, entries in the online encyclopedia Wikipedia are not only connected by a dense topology of lateral links, they are also organized by a rich hierarchy of overlapping categories. Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. Nonetheless, WordNet can be a valuable component of a hybrid measure, and Agirre et al. (2009) use an SVM (support vector machine) to combine information from WordNet with information harvested from the web. Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. Similarity is not always applied to pairs of concepts; it is sometimes analogically applied to pairs of pairs of concepts, as in proportional analogies of the form A is to B as C is to D (e.g., hacks are to writers as mercenaries are to soldiers, or chisels are to sculptors as scalpels are to surgeons). In such analogies, one is really assessing the similarity of the unstated relationship between each pair of concepts: thus, mercenaries are soldiers whose allegiance is paid for, much as hacks are writers with income-driven loyalties; sculptors use chisels to carve stone, while surgeons use scalpels to cut or carve flesh. Veale (2004) used WordNet to assess the similarity of A:B to C:D as a function of the combined similarity of A to C and of B to D. In contrast, Turney (2005) used the web to pursue a more divergent course, to represent the tacit relationships of A to B and of C to D as points in a highdimensional space. The dimensions of this space initially correspond to linking phrases on the web, before these dimensions are significantly reduced using singular value decomposition. In the infamous SAT test, an analogy A:B::C:D has four other pairs of concepts that serve as likely distractors (e.g. singer:songwriter for hack:writer) and the goal is to choose the most appropriate C:D pair for a given A:B pairing. Using variants of Wu and Palmer (1994) on the 374 SAT analogies of Turney (2005), Veale (2004) reports a success rate of 38–44% using only WordNet-based similarity. In contrast, Turney (2005) reports up to 55% success on the same analogies, partly because his approach aims 662 to match implicit relations rather than explicit concepts, and in part because it uses a divergent process to gather from the web as rich a perspec- tive as it can on these latent relationships. 2.1 Clever Comparisons Create Similarity Each of these approaches to similarity is a user of information, rather than a creator, and each fails to capture how a creative comparison (such as a metaphor) can spur a listener to view a topic from an atypical perspective. Camac & Glucksberg (1984) provide experimental evidence for the claim that “metaphors do not use preexisting associations to achieve their effects [… ] people use metaphors to create new relations between concepts.” They also offer a salutary reminder of an often overlooked fact: every comparison exploits information, but each is also a source of new information in its own right. Thus, “this cola is acid” reveals a different perspective on cola (e.g. as a corrosive substance or an irritating food) than “this acid is cola” highlights for acid (such as e.g., a familiar substance) Veale & Keane (1994) model the role of similarity in realizing the long-term perlocutionary effect of an informative comparison. For example, to compare surgeons to butchers is to encourage one to see all surgeons as more bloody, … crude or careless. The reverse comparison, of butchers to surgeons, encourages one to see butchers as more skilled and precise. Veale & Keane present a network model of memory, called Sapper, in which activation can spread between related concepts, thus allowing one concept to prime the properties of a neighbor. To interpret an analogy, Sapper lays down new activation-carrying bridges in memory between analogical counterparts, such as between surgeon & butcher, flesh & meat, and scalpel & cleaver. Comparisons can thus have lasting effects on how Sapper sees the world, changing the pattern of activation that arises when it primes a concept. Veale (2003) adopts a similarly dynamic view of similarity in WordNet, showing how an analogical comparison can result in the automatic addition of new categories and relations to WordNet itself. Veale considers the problem of finding an analogical mapping between different parts of WordNet’s noun-sense hierarchy, such as between instances of Greek god and Norse god, or between the letters of different alphabets, such as of Greek and Hebrew. But no structural similarity measure for WordNet exhibits enough discernment to e.g. assign a higher similarity to Zeus & Odin (each is the supreme deity of its pantheon) than to a pairing of Zeus and any other Norse god, just as no structural measure will assign a higher similarity to Alpha & Aleph or to Beta & Beth than to any random letter pairing. A fine-grained category hierarchy permits fine-grained similarity judgments, and though WordNet is useful, its sense hierarchies are not especially fine-grained. However, we can automatically make WordNet subtler and more discerning, by adding new fine-grained categories to unite lexical concepts whose similarity is not reflected by any existing categories. Veale (2003) shows how a property that is found in the glosses of two lexical concepts, of the same depth, can be combined with their LCS to yield a new fine-grained parent category, so e.g. “supreme” + deity = Supreme-deity (for Odin, Zeus, Jupiter, etc.) and “1 st” + letter = 1st-letter (for Alpha, Aleph, etc.) Selected aspects of the textual similarity of two WordNet glosses – the key to similarity in Lesk (1986) – can thus be reified into an explicitly categorical WordNet form. 3 Divergent (Re)Categorization To tap into a richer source of concept properties than WordNet’s glosses, we can use web ngrams. Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). The numbers to the right are Google frequency counts. a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 To find the stable properties that can underpin a meaningful fine-grained category for cowboy, we must seek out the properties that are so often presupposed to be salient of all cowboys that one can use them to anchor a simile, such as
4 0.7324819 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models
Author: Abdellah Fourtassi ; Emmanuel Dupoux
Abstract: Evaluation methods for Distributional Semantic Models typically rely on behaviorally derived gold standards. These methods are difficult to deploy in languages with scarce linguistic/behavioral resources. We introduce a corpus-based measure that evaluates the stability of the lexical semantic similarity space using a pseudo-synonym same-different detection task and no external resources. We show that it enables to predict two behaviorbased measures across a range of parameters in a Latent Semantic Analysis model.
5 0.72983962 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
Author: Mohammad Taher Pilehvar ; David Jurgens ; Roberto Navigli
Abstract: Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: seman- tic textual similarity, word similarity, and word sense coarsening.
6 0.71883649 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity
7 0.71558696 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
8 0.7104876 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit
9 0.70125526 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
10 0.64996278 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models
11 0.6417821 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages
12 0.61638051 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian
13 0.60215873 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
14 0.59067971 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.
15 0.58039987 238 acl-2013-Measuring semantic content in distributional vectors
16 0.578134 390 acl-2013-Word surprisal predicts N400 amplitude during reading
17 0.57630795 242 acl-2013-Mining Equivalent Relations from Linked Data
18 0.57015914 371 acl-2013-Unsupervised joke generation from big data
19 0.56269538 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections
20 0.56140727 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures
topicId topicWeight
[(0, 0.108), (6, 0.038), (11, 0.052), (15, 0.31), (24, 0.035), (26, 0.05), (28, 0.01), (35, 0.071), (42, 0.037), (48, 0.041), (70, 0.036), (88, 0.029), (90, 0.035), (95, 0.073)]
simIndex simValue paperId paperTitle
1 0.95752841 232 acl-2013-Linguistic Models for Analyzing and Detecting Biased Language
Author: Marta Recasens ; Cristian Danescu-Niculescu-Mizil ; Dan Jurafsky
Abstract: Unbiased language is a requirement for reference sources like encyclopedias and scientific texts. Bias is, nonetheless, ubiquitous, making it crucial to understand its nature and linguistic realization and hence detect bias automatically. To this end we analyze real instances of human edits designed to remove bias from Wikipedia articles. The analysis uncovers two classes of bias: framing bias, such as praising or perspective-specific words, which we link to the literature on subjectivity; and epistemological bias, related to whether propositions that are presupposed or entailed in the text are uncontroversially accepted as true. We identify common linguistic cues for these classes, including factive verbs, implicatives, hedges, and subjective inten- cs . sifiers. These insights help us develop features for a model to solve a new prediction task of practical importance: given a biased sentence, identify the bias-inducing word. Our linguistically-informed model performs almost as well as humans tested on the same task.
same-paper 2 0.83755517 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire
Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.
3 0.81039864 199 acl-2013-Integrating Multiple Dependency Corpora for Inducing Wide-coverage Japanese CCG Resources
Author: Sumire Uematsu ; Takuya Matsuzaki ; Hiroki Hanaoka ; Yusuke Miyao ; Hideki Mima
Abstract: This paper describes a method of inducing wide-coverage CCG resources for Japanese. While deep parsers with corpusinduced grammars have been emerging for some languages, those for Japanese have not been widely studied, mainly because most Japanese syntactic resources are dependency-based. Our method first integrates multiple dependency-based corpora into phrase structure trees and then converts the trees into CCG derivations. The method is empirically evaluated in terms of the coverage of the obtained lexi- con and the accuracy of parsing.
4 0.79254669 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit
Author: Vasile Rus ; Mihai Lintean ; Rajendra Banjade ; Nobal Niraula ; Dan Stefanescu
Abstract: We present in this paper SEMILAR, the SEMantic simILARity toolkit. SEMILAR implements a number of algorithms for assessing the semantic similarity between two texts. It is available as a Java library and as a Java standalone ap-plication offering GUI-based access to the implemented semantic similarity methods. Furthermore, it offers facilities for manual se-mantic similarity annotation by experts through its component SEMILAT (a SEMantic simILarity Annotation Tool). 1
5 0.74705589 233 acl-2013-Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media
Author: Weiwei Guo ; Hao Li ; Heng Ji ; Mona Diab
Abstract: Many current Natural Language Processing [NLP] techniques work well assuming a large context of text as input data. However they become ineffective when applied to short texts such as Twitter feeds. To overcome the issue, we want to find a related newswire document to a given tweet to provide contextual support for NLP tasks. This requires robust modeling and understanding of the semantics of short texts. The contribution of the paper is two-fold: 1. we introduce the Linking-Tweets-toNews task as well as a dataset of linked tweet-news pairs, which can benefit many NLP applications; 2. in contrast to previ- ous research which focuses on lexical features within the short texts (text-to-word information), we propose a graph based latent variable model that models the inter short text correlations (text-to-text information). This is motivated by the observation that a tweet usually only covers one aspect of an event. We show that using tweet specific feature (hashtag) and news specific feature (named entities) as well as temporal constraints, we are able to extract text-to-text correlations, and thus completes the semantic picture of a short text. Our experiments show significant improvement of our new model over baselines with three evaluation metrics in the new task.
6 0.73137528 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
7 0.57416874 250 acl-2013-Models of Translation Competitions
8 0.57022214 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
9 0.55897981 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics
10 0.55651504 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
11 0.55130303 318 acl-2013-Sentiment Relevance
12 0.54997009 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia
13 0.54943693 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
14 0.54481632 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation
15 0.5416398 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features
16 0.53590524 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations
17 0.53479666 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation
18 0.53032398 224 acl-2013-Learning to Extract International Relations from Political Context
19 0.52819729 148 acl-2013-Exploring Sentiment in Social Media: Bootstrapping Subjectivity Clues from Multilingual Twitter Streams
20 0.52805865 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning