acl acl2013 acl2013-185 knowledge-graph by maker-knowledge-mining

185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

Source: pdf

Author: Olivier Ferret

Abstract: Distributional thesauri are now widely used in a large number of Natural Language Processing tasks. However, they are far from containing only interesting semantic relations. As a consequence, improving such thesaurus is an important issue that is mainly tackled indirectly through the improvement of semantic similarity measures. In this article, we propose a more direct approach focusing on the identification of the neighbors of a thesaurus entry that are not semantically linked to this entry. This identification relies on a discriminative classifier trained from unsupervised selected examples for building a distributional model of the entry in texts. Its bad neighbors are found by applying this classifier to a representative set of occurrences of each of these neighbors. We evaluate the interest of this method for a large set of English nouns with various frequencies.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 As a consequence, improving such thesaurus is an important issue that is mainly tackled indirectly through the improvement of semantic similarity measures. [sent-5, score-0.708]

2 In this article, we propose a more direct approach focusing on the identification of the neighbors of a thesaurus entry that are not semantically linked to this entry. [sent-6, score-1.373]

3 This identification relies on a discriminative classifier trained from unsupervised selected examples for building a distributional model of the entry in texts. [sent-7, score-0.803]

4 Its bad neighbors are found by applying this classifier to a representative set of occurrences of each of these neighbors. [sent-8, score-0.823]

5 1 Introduction The work we present in this article focuses on the automatic building of a thesaurus from a corpus. [sent-10, score-0.573]

6 As illustrated by Table 1, such thesaurus gives for each of its entries a list of words, called semantic neighbors, that are supposed to be semantically linked to the entry. [sent-11, score-0.805]

7 Generally, each neighbor is associated with a weight that characterizes the strength of its link with the entry and all the neighbors of an entry are sorted according to the decreasing order of their weight. [sent-12, score-1.123]

8 The term semantic neighbor is very generic and can have two main interpretations according to the kind of semantic relations it is based on: one relies only on paradigmatic relations, such as hy- pernymy or synonymy, fe rret @ cea . [sent-13, score-0.453]

9 The distinction between these two interpretations refers to the distinction between the notions of semantic similarity and semantic relatedness as it was done in (Budanitsky and Hirst, 2006) or in (Zesch and Gurevych, 2010) for instance. [sent-15, score-0.307]

10 However, the limit between these two notions is sometimes hard to find in existing work as terms semantic similarity and semantic relatedness are often used interchangeably. [sent-16, score-0.307]

11 Moreover, semantic similarity is frequently considered as included into semantic relatedness and the two problems are often tackled by using the same methods. [sent-17, score-0.343]

12 In the remainder of this article, we will use the term semantic similarity with its generic sense and the term semantic relatedness for referring more specifically to similarity based on syntagmatic relations. [sent-18, score-0.45]

13 Following work such as (Grefenstette, 1994), a widespread way to build a thesaurus from a corpus is to use a semantic similarity measure for ex- tracting the semantic neighbors of the entries of the thesaurus. [sent-19, score-1.365]

14 Work based on WordNet-like lexical networks for building semantic similarity measures such as (Budanitsky and Hirst, 2006) or (Pedersen et al. [sent-22, score-0.253]

15 The last option is the corpusbased approach, based on the distributional hypothesis (Firth, 1957): each word is characterized by the set of contexts from a corpus in which it appears and the semantic similarity of two words is computed from the contexts they share. [sent-29, score-0.658]

16 The problem of improving the results of the “classical” implementation of the distributional approach as it can be found in (Curran and Moens, 2002a) for instance was already tackled by some work. [sent-32, score-0.313]

17 , 2012) or the redefinition of the distributional approach in a Bayesian framework (Kazama et al. [sent-37, score-0.242]

18 2 Principles Our work shares with (Zhitomirsky-Geffet and Dagan, 2009) the use of a kind of bootstrapping as it starts from a distributional thesaurus and to some extent, exploits it for its improvement. [sent-40, score-0.788]

19 In Table 1, waterworks for the entry cabdriver and hollowness for the entry machination are two examples of such kind of neighbors. [sent-42, score-0.62]

20 By discarding these bad neighbors or at least by downgrading them, the rank of true semantic neighbors is expected to be lower. [sent-43, score-1.267]

21 This makes the thesaurus more interesting to use since the quality of such thesaurus strongly decreases as the rank of the neighbors of its entries increases (see Section 4. [sent-44, score-1.589]

22 1 for an illustration), which means in practice that only the first neighbors of an entry can be generally exploited. [sent-45, score-0.762]

23 The approach we propose for identifying the bad semantic neighbors of a thesaurus entry relies on the distributional hypothesis, as the method for the initial building of the thesaurus, but implements it in a different way. [sent-46, score-1.824]

24 This hypothesis roughly specifies that from a semantic viewpoint, the meaning of a word can be characterized by the set of contexts in which this word occurs. [sent-47, score-0.264]

25 In work such as (Curran and Moens, 2002a), this hypothesis is implemented by collecting for each entry the words it co-occurs with in a large corpus. [sent-49, score-0.329]

26 This co-occurrence can be based either on the position of the word in the text in relation to the entry or on the presence of a syntactic relation between the entry and the word. [sent-50, score-0.54]

27 As a result, the distributional representation of a word takes the unstructured form of a bag of words or the more structured form of a set of pairs {syntactic relation, word}. [sent-51, score-0.268]

28 , 2010) where the distributional representation of a word is modeled as a multinomial distribution with Dirichlet as prior. [sent-53, score-0.242]

29 Table 1: First neighbors of some entries of the distributional thesaurus of section 3. [sent-88, score-1.302]

30 This model aims more precisely at discriminating from a semantic viewpoint a word in context, i. [sent-90, score-0.216]

31 in a sentence, from all other words and more particularly, from those of its neighbors in a distributional thesaurus that are likely to be actually not semantically similar to it. [sent-92, score-1.316]

32 The underlying hypothesis follows the distributional principles: a word and a synonym should appear in the same contexts, which means that they are characterized by the same features. [sent-93, score-0.352]

33 More precisely, we found that such model is specifically effective for discarding the bad neighbors of the entries of a distributional thesaurus. [sent-95, score-0.93]

34 1 Overview The principles presented in the previous section face one major problem compared to the “classical” distributional approach : the semantic similarity of two words can be evaluated directly by computing the similarity of their distributional representations. [sent-97, score-0.797]

35 As a consequence, for deciding whether a neighbor of a thesaurus entry is a bad neighbor or not, the discriminative model of the entry has to be applied to occurrences of this neighbor in texts. [sent-100, score-1.507]

36 2 Building of the initial thesaurus Before introducing our method for improving distributional thesauri, we first present the way we build such a thesaurus. [sent-103, score-0.833]

37 As in (Lin, 1998) or (Cur- ran and Moens, 2002a), this building is based on the definition of a semantic similarity measure from a corpus. [sent-104, score-0.244]

38 For the extraction of distributional data and the characteristics of the distributional similarity measure, we adopted the options of (Ferret, 2010), resulting from a kind of grid search procedure performed with the extended TOEFL test proposed in (Freitag et al. [sent-110, score-0.598]

39 More precisely, the following characteristics were taken: • distributional contexts made of the coodicsctruirbreuntitosn ncaoll e ccotnedte xint a s3 m mwadored w ofind tohwe centered on each occurrence in the corpus of the target word. [sent-112, score-0.382]

40 The building of our initial thesaurus from the similarity measure above was performed classically by extracting the closest semantic neighbors of each of its entries. [sent-114, score-1.292]

41 More precisely, the selected measure was computed between each entry and its possible neighbors. [sent-115, score-0.301]

42 These neighbors were then ranked in the decreasing order of the values of this measure and the first 100 neighbors were kept as the semantic neighbors of the entry. [sent-116, score-1.592]

43 Both entries and possible neighbors were AQUAINT-2 nouns whose frequency was higher than 10. [sent-117, score-0.682]

44 1, the starting point of our reranking process is the definition of a model for determining to what extent a word in a sentence, which is not supposed to be known in the context of this task, corresponds or not to a reference word E. [sent-120, score-0.278]

45 In the context of our global objective, we are not of course interested by this task itself but rather by the fact that such classifier is likely to model the contexts in which E occurs and as a consequence, is also likely to model its meaning according to the distributional hypothesis. [sent-122, score-0.493]

46 As a consequence of this view, we adopt the same kind of features as the ones used for WSD for building our classifier. [sent-127, score-0.206]

47 1, a specific SVM classifier is trained for each entry of our initial thesaurus, which requires the unsupervised selection of a set of positive and negative examples. [sent-144, score-0.533]

48 The case of positive examples is simple: a fixed number of sentences containing at least one occurrence of the target entry are randomly chosen in the corpus used for building our 564 initial thesaurus and the first occurrence of this entry in the sentence is taken as a positive example. [sent-145, score-1.382]

49 Since we want to characterize words as much as possible from a semantic viewpoint, the selection of negative examples is guided by our initial thesaurus. [sent-146, score-0.312]

50 In practice, taking neighbors with a rather small rank as negative examples is a better option because these examples are more useful in terms of discrimination as they are close to the transition zone between negative and positive examples. [sent-149, score-0.757]

51 Moreover, in order to limit the risk of selecting only false negative examples, three neighbors are taken as negative examples, at ranks 10, 15 and 202. [sent-150, score-0.578]

52 For each of these negative examples, a fixed number of sentences is selected follow- ing the same principles as for positive examples, which means that on average, the number of negative examples is equal to three times the number of positive examples. [sent-151, score-0.241]

53 This ratio reflects the fact that among the neighbors of an entry, the number of those that are semantically similar to the entry is far lower than the number of those that are not. [sent-152, score-0.845]

54 4 Identification of bad neighbors and thesaurus reranking Once a word-in-context classifier was trained for an entry, it is used for identifying the bad neighbors of this entry, that is to say the neighbors that are not semantically similar to it. [sent-154, score-2.441]

55 As this classifier can only be applied to words in context, a fixed number of representative occurrences have to be selected from our reference corpus for each neighbor of the entry. [sent-155, score-0.41]

56 The application of our word-in-context classifier to each of these occurrences determines whether the context of this occurrence is likely to be compatible with the con- text of an occurrence of the entry. [sent-157, score-0.355]

57 In practice, the decision ofthe classifier is rarely 1More precisely, an example here is an occurrence of a word in a text but by extension, we also use the term example for referring to the word itself. [sent-158, score-0.2]

58 Conversely, a neighbor is defined as “bad” if the number of its reference occurrences tagged positively by our classifier is lower or equal to G. [sent-163, score-0.343]

59 The neighbors of an entry identified as bad neighbors are not fully discarded. [sent-164, score-1.355]

60 Among the downgraded neighbors, their initial order is left unchanged. [sent-166, score-0.185]

61 It should be noted that the word-in-context classifier is not applied to the neighbors whose occurrences are used for its training as it would frequently lead to downgrade these neighbors, which is not necessarily optimum as we chose them with a rather low rank. [sent-167, score-0.681]

62 1 Initial thesaurus evaluation Table 2 shows the results of the evaluation of our initial thesaurus, achieved by comparing the selected semantic neighbors with two complementary reference resources: WordNet 3. [sent-169, score-1.196]

63 0 synonyms (Miller, 1990) [W], which characterize a semantic similarity based on paradigmatic relations, and the Moby thesaurus (Ward, 1996) [M], which gathers a larger set of types of relations and is more rep- resentative of semantic relatedness3. [sent-170, score-0.905]

64 8 7 sources were filtered to discard entries and synonyms that are not part of the AQUAINT-2 vo- cabulary (see the difference between the number of words in the first column and the number of evaluated words of the third column). [sent-190, score-0.203]

65 3 Evaluation of the reranked thesaurus Table 4 gives the evaluation of the application of our reranking method to the initial thesaurus according to the same principles as in section 4. [sent-195, score-1.226]

66 As the recall measure and the precision for the last rank do not change in a reranking process, they are not given again. [sent-198, score-0.19]

67 The first thing to notice is that at the global scale, all measures for all references are significantly improved6, which means that our hypothesis about the possibility for a discriminative classifier to capture the meaning of a word tends to be validated. [sent-199, score-0.242]

68 First, the improvement of results is particularly effective for middle frequency entries, then for low frequency and finally, for high frequency entries. [sent-203, score-0.205]

69 Because of their already high level in the initial thesaurus, results for high frequency entries are difficult to improve but it is important to note that our selection of bad neighbors has a very low error rate, which at least preserves these results. [sent-204, score-0.83]

70 This is confirmed by the fact that, with WordNet as reference, only 744 neighbors were found wrongly downgraded, spread over 686 entries, which represents only 5% of all downgraded neighbors. [sent-205, score-0.594]

71 The second main trend of Table 4 con5The use of W as reference is justified by the fact that the number of synonyms for an entry in W is more compatible, especially for R-precision, with the real use of the resulting thesaurus in an application. [sent-206, score-0.862]

72 6The statistical significance of differences with the initial thesaurus was evaluated by a paired Wilcoxon test with pvalue < 0. [sent-207, score-0.556]

73 WMobrdyNetancsdeontumpicse,trac e,dtviaogednrhme,sintrycae,mpntri+ceogn7,ac9rid,aweotgfinraesmndri,psem acrgot,ceiao npt-, teem Table 5 illustrates more precisely the impact of our reranking procedure for the middle frequency entry esteem. [sent-216, score-0.558]

74 Its WordNet row gives all the reference synonyms for this entry in WordNet while its Moby row gives the first reference related words for this entry in Moby. [sent-217, score-0.748]

75 In our initial thesaurus, the first two neighbors of esteem that are present in our reference resources are admiration (rank 3) and respect (rank 7). [sent-218, score-0.638]

76 The reranking produces a thesaurus in which these two words appear as the second and the third neighbors of the entry because neighbors without clear relation with it such as back- scratching were downgraded while its third synonym in WordNet is raised from rank 22 to rank 15. [sent-219, score-2.108]

77 Moreover, the number of neighbors among the first 15 ones that are present in Moby increases from 3 to 5. [sent-220, score-0.492]

78 5 Related work The building of distributional thesaurus is generally viewed as an application or a mode of evaluation of work about semantic similarity or semantic relatedness. [sent-221, score-1.013]

79 As a consequence, the improvement of such thesaurus is generally not directly addressed but is a possible consequence of the improvement of semantic similarity measures. [sent-222, score-0.759]

80 However, the extent of this improvement is rarely evaluated as most of the work about semantic similarity is evaluated on datasets such as the WordSim-353 test collection (Gabrilovich and Markovitch, 2007), which are only partially representative of the results for thesaurus building. [sent-223, score-0.707]

81 , 2009) proposes a new weighting scheme of words in distributional contexts that replaces the weight of 568 word by a function of its rank in the context, which is a way to be less dependent on the values of a particular weighting function. [sent-227, score-0.463]

82 (Zhitomirsky-Geffet and Dagan, 2009) shares with our work the use of bootstrapping by relying on an initial thesaurus to derive means of improving it. [sent-228, score-0.629]

83 More specifically, (Zhitomirsky-Geffet and Dagan, 2009) assumes that the first neighbors of an entry are more relevant than the others and as a consequence, that their most significant features are also representative of the meaning of the entry. [sent-229, score-0.833]

84 The neighbors of the entry are reranked according to this hypothesis by increasing the weight of these features to favor their influence in the distributional contexts that support the evaluation of the similarity between the entry and its neighbors. [sent-230, score-1.513]

85 One main difference between all these works and ours is that they assume that the initial thesaurus was built by relying on distributional contexts represented as bags-of-words. [sent-232, score-0.875]

86 Our method does not make this assumption as its reranking is based on a classifier built in an unsupervised way7 from and applied to the corpus used for building the initial thesaurus. [sent-233, score-0.339]

87 If we focus more specifically on the improvement of distributional thesauri, (Ferret, 2012) is the most comparable work to ours, both because it is specifically focused on this task and it is based on the same evaluation framework. [sent-235, score-0.242]

88 One of the objectives of (Ferret, 2012) was to rebalance the initial thesaurus in favor of low frequency entries. [sent-237, score-0.615]

89 Although this objective was reached, the resulting thesaurus tends to have a lower performance than the initial thesaurus for high frequency entries and for synonyms. [sent-238, score-1.183]

90 The problem with high frequency entries comes from the fact that applying a machine learning classifier to its training examples does not lead to a perfect result. [sent-239, score-0.303]

91 In both cases, the method proposed in (Ferret, 2012) faces the problem of relying only on the distributional thesaurus it tries to improve. [sent-242, score-0.74]

92 This is an important difference with the method presented in this article, which mainly exploits the context of the occurrences of words in the corpus used for the building the initial thesaurus. [sent-243, score-0.283]

93 As a consequence, at a global scale, our reranked thesaurus outperforms the final thesaurus of (Ferret, 2012) for nearly all measures. [sent-244, score-0.996]

94 6 Conclusion and perspectives In this article, we have presented a new approach for reranking the semantic neighbors of a distributional thesaurus. [sent-247, score-0.922]

95 This approach relies on the unsupervised building of discriminative classifiers dedicated to the identification of its entries in texts, with the objective to characterize their meaning according to the distributional hypothesis. [sent-248, score-0.539]

96 The classifier built for an entry is then applied to a set of occurrences of its neighbors for identifying and downgrading those that are not semantically related to the entry. [sent-249, score-1.075]

97 The proposed method was tested on a large thesaurus of nouns for English and led to a significant improvement of this thesaurus, especially for middle and low frequency entries and for semantic relatedness. [sent-250, score-0.776]

98 Testing semantic similarity measures for extracting synonyms from a corpus. [sent-285, score-0.26]

99 Combining bootstrapping and feature selection for improving a distributional thesaurus. [sent-289, score-0.315]

100 A bayesian method for robust estimation of distributional similarities. [sent-330, score-0.242]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('neighbors', 0.492), ('thesaurus', 0.473), ('entry', 0.27), ('distributional', 0.242), ('moby', 0.163), ('ferret', 0.143), ('consequence', 0.122), ('classifier', 0.104), ('reranking', 0.103), ('downgraded', 0.102), ('bad', 0.101), ('precisely', 0.098), ('entries', 0.095), ('neighbor', 0.091), ('semantic', 0.085), ('occurrences', 0.085), ('initial', 0.083), ('semantically', 0.083), ('similarity', 0.079), ('contexts', 0.077), ('thesauri', 0.074), ('grefenstette', 0.069), ('broda', 0.066), ('occurrence', 0.063), ('reference', 0.063), ('asakura', 0.061), ('frequency', 0.059), ('relatedness', 0.058), ('dagan', 0.057), ('synonyms', 0.056), ('rank', 0.056), ('curran', 0.054), ('paradigmatic', 0.053), ('neighboring', 0.052), ('wsd', 0.052), ('article', 0.051), ('yamamoto', 0.05), ('reranked', 0.05), ('olivier', 0.049), ('building', 0.049), ('moens', 0.048), ('collocations', 0.046), ('examples', 0.045), ('relations', 0.044), ('principles', 0.044), ('wordnet', 0.043), ('classical', 0.043), ('supposed', 0.043), ('freitag', 0.043), ('negative', 0.043), ('downgrading', 0.041), ('representative', 0.041), ('measures', 0.04), ('gabrilovich', 0.04), ('reisinger', 0.04), ('context', 0.04), ('characterized', 0.039), ('budanitsky', 0.039), ('synonym', 0.038), ('bootstrapping', 0.038), ('heylen', 0.036), ('tackled', 0.036), ('nouns', 0.036), ('discriminative', 0.035), ('improving', 0.035), ('kind', 0.035), ('hypothesis', 0.033), ('kazama', 0.033), ('viewpoint', 0.033), ('referring', 0.033), ('alexandrescu', 0.033), ('positive', 0.033), ('hirst', 0.033), ('collocation', 0.031), ('pedersen', 0.031), ('syntagmatic', 0.031), ('firth', 0.031), ('halliday', 0.031), ('cea', 0.031), ('measure', 0.031), ('weighting', 0.031), ('characterize', 0.03), ('meaning', 0.03), ('dirk', 0.03), ('proposals', 0.03), ('extent', 0.029), ('identification', 0.029), ('relies', 0.029), ('markovitch', 0.028), ('middle', 0.028), ('treetagger', 0.027), ('zesch', 0.027), ('wisdom', 0.027), ('wm', 0.027), ('linked', 0.026), ('synonymy', 0.026), ('morris', 0.026), ('words', 0.026), ('faces', 0.025), ('widespread', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

Author: Olivier Ferret

2 0.14257373 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

Author: Lis Pereira ; Erlyn Manguilimotan ; Yuji Matsumoto

Abstract: This study addresses issues of Japanese language learning concerning word combinations (collocations). Japanese learners may be able to construct grammatically correct sentences, however, these may sound “unnatural”. In this work, we analyze correct word combinations using different collocation measures and word similarity methods. While other methods use well-formed text, our approach makes use of a large Japanese language learner corpus for generating collocation candidates, in order to build a system that is more sensitive to constructions that are difficult for learners. Our results show that we get better results compared to other methods that use only wellformed text. 1

3 0.14249435 116 acl-2013-Detecting Metaphor by Contextual Analogy

Author: Eirini Florou

Abstract: As one of the most challenging issues in NLP, metaphor identification and its interpretation have seen many models and methods proposed. This paper presents a study on metaphor identification based on the semantic similarity between literal and non literal meanings of words that can appear at the same context.

4 0.13202636 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

Author: Mohammad Taher Pilehvar ; David Jurgens ; Roberto Navigli

Abstract: Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: seman- tic textual similarity, word similarity, and word sense coarsening.

5 0.13159131 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

Author: Tony Veale ; Guofu Li

Abstract: Just as observing is more than just seeing, comparing is far more than mere matching. It takes understanding, and even inventiveness, to discern a useful basis for judging two ideas as similar in a particular context, especially when our perspective is shaped by an act of linguistic creativity such as metaphor, simile or analogy. Structured resources such as WordNet offer a convenient hierarchical means for converging on a common ground for comparison, but offer little support for the divergent thinking that is needed to creatively view one concept as another. We describe such a means here, by showing how the web can be used to harvest many divergent views for many familiar ideas. These lateral views complement the vertical views of WordNet, and support a system for idea exploration called Thesaurus Rex. We show also how Thesaurus Rex supports a novel, generative similarity measure for WordNet. 1 Seeing is Believing (and Creating) Similarity is a cognitive phenomenon that is both complex and subjective, yet for practical reasons it is often modeled as if it were simple and objective. This makes sense for the many situations where we want to align our similarity judgments with those of others, and thus focus on the same conventional properties that others are also likely to focus upon. This reliance on the consensus viewpoint explains why WordNet (Fellbaum, 1998) has proven so useful as a basis for computational measures of lexico-semantic similarity Guofu Li School of Computer Science and Informatics, University College Dublin, Belfield, Dublin D2, Ireland. l .guo fu . l gmai l i @ .com (e.g. see Pederson et al. 2004, Budanitsky & Hirst, 2006; Seco et al. 2006). These measures reduce the similarity of two lexical concepts to a single number, by viewing similarity as an objective estimate of the overlap in their salient qualities. This convenient perspective is poorly suited to creative or insightful comparisons, but it is sufficient for the many mundane comparisons we often perform in daily life, such as when we organize books or look for items in a supermarket. So if we do not know in which aisle to locate a given item (such as oatmeal), we may tacitly know how to locate a similar product (such as cornflakes) and orient ourselves accordingly. Yet there are occasions when the recognition of similarities spurs the creation of similarities, when the act of comparison spurs us to invent new ways of looking at an idea. By placing pop tarts in the breakfast aisle, food manufacturers encourage us to view them as a breakfast food that is not dissimilar to oatmeal or cornflakes. When ex-PM Tony Blair published his memoirs, a mischievous activist encouraged others to move his book from Biography to Fiction in bookshops, in the hope that buyers would see it in a new light. Whenever we use a novel metaphor to convey a non-obvious viewpoint on a topic, such as “cigarettes are time bombs”, the comparison may spur us to insight, to see aspects of the topic that make it more similar to the vehicle (see Ortony, 1979; Veale & Hao, 2007). In formal terms, assume agent A has an insight about concept X, and uses the metaphor X is a Y to also provoke this insight in agent B. To arrive at this insight for itself, B must intuit what X and Y have in common. But this commonality is surely more than a standard categorization of X, or else it would not count as an insight about X. To understand the metaphor, B must place X 660 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 6 0–670, in a new category, so that X can be seen as more similar to Y. Metaphors shape the way we per- ceive the world by re-shaping the way we make similarity judgments. So if we want to imbue computers with the ability to make and to understand creative metaphors, we must first give them the ability to look beyond the narrow viewpoints of conventional resources. Any measure that models similarity as an objective function of a conventional worldview employs a convergent thought process. Using WordNet, for instance, a similarity measure can vertically converge on a common superordinate category of both inputs, and generate a single numeric result based on their distance to, and the information content of, this common generalization. So to find the most conventional ways of seeing a lexical concept, one simply ascends a narrowing concept hierarchy, using a process de Bono (1970) calls vertical thinking. To find novel, non-obvious and useful ways of looking at a lexical concept, one must use what Guilford (1967) calls divergent thinking and what de Bono calls lateral thinking. These processes cut across familiar category boundaries, to simultaneously place a concept in many different categories so that we can see it in many different ways. de Bono argues that vertical thinking is selective while lateral thinking is generative. Whereas vertical thinking concerns itself with the “right” way or a single “best” way of looking at things, lateral thinking focuses on producing alternatives to the status quo. To be as useful for creative tasks as they are for conventional tasks, we need to re-imagine our computational similarity measures as generative rather than selective, expansive rather than reductive, divergent as well as convergent and lateral as well as vertical. Though WordNet is ideally structured to support vertical, convergent reasoning, its comprehensive nature means it can also be used as a solid foundation for building a more lateral and divergent model of similarity. Here we will use the web as a source of diverse perspectives on familiar ideas, to complement the conventional and often narrow views codified by WordNet. Section 2 provides a brief overview of past work in the area of similarity measurement, before section 3 describes a simple bootstrapping loop for acquiring richly diverse perspectives from the web for a wide variety of familiar ideas. These perspectives are used to enhance a Word- Net-based measure of lexico-semantic similarity in section 4, by broadening the range of informative viewpoints the measure can select from. Similarity is thus modeled as a process that is both generative and selective. This lateral-andvertical approach is evaluated in section 5, on the Miller & Charles (1991) data-set. A web app for the lateral exploration of diverse viewpoints, named Thesaurus Rex, is also presented, before closing remarks are offered in section 6. 2 Related Work and Ideas WordNet’s taxonomic organization of nounsenses and verb-senses – in which very general categories are successively divided into increasingly informative sub-categories or instancelevel ideas – allows us to gauge the overlap in information content, and thus of meaning, of two lexical concepts. We need only identify the deepest point in the taxonomy at which this content starts to diverge. This point of divergence is often called the LCS, or least common subsumer, of two concepts (Pederson et al., 2004). Since sub-categories add new properties to those they inherit from their parents – Aristotle called these properties the differentia that stop a category system from trivially collapsing into itself – the depth of a lexical concept in a taxonomy is an intuitive proxy for its information content. Wu & Palmer (1994) use the depth of a lexical concept in the WordNet hierarchy as such a proxy, and thereby estimate the similarity of two lexical concepts as twice the depth of their LCS divided by the sum of their individual depths. Leacock and Chodorow (1998) instead use the length of the shortest path between two concepts as a proxy for the conceptual distance between them. To connect any two ideas in a hierarchical system, one must vertically ascend the hierarchy from one concept, change direction at a potential LCS, and then descend the hierarchy to reach the second concept. (Aristotle was also first to suggest this approach in his Poetics). Leacock and Chodorow normalize the length of this path by dividing its size (in nodes) by twice the depth of the deepest concept in the hierarchy; the latter is an upper bound on the distance between any two concepts in the hierarchy. Negating the log of this normalized length yields a corresponding similarity score. While the role of an LCS is merely implied in Leacock and Chodorow’s use of a shortest path, the LCS is pivotal nonetheless, and like that of Wu & Palmer, the approach uses an essentially vertical reasoning process to identify a single “best” generalization. Depth is a convenient proxy for information content, but more nuanced proxies can yield 661 more rounded similarity measures. Resnick (1995) draws on information theory to define the information content of a lexical concept as the negative log likelihood of its occurrence in a corpus, either explicitly (via a direct mention) or by presupposition (via a mention of any of its sub-categories or instances). Since the likelihood of a general category occurring in a corpus is higher than that of any of its sub-categories or instances, such categories are more predictable, and less informative, than rarer categories whose occurrences are less predictable and thus more informative. The negative log likelihood of the most informative LCS of two lexical concepts offers a reliable estimate of the amount of infor- mation shared by those concepts, and thus a good estimate of their similarity. Lin (1998) combines the intuitions behind Resnick’s metric and that of Wu and Palmer to estimate the similarity of two lexical concepts as an information ratio: twice the information content of their LCS divided by the sum of their individual information contents. Jiang and Conrath (1997) consider the converse notion of dissimilarity, noting that two lexical concepts are dissimilar to the extent that each contains information that is not shared by the other. So if the information content of their most informative LCS is a good measure of what they do share, then the sum of their individual information contents, minus twice the content of their most informative LCS, is a reliable estimate of their dissimilarity. Seco et al. (2006) presents a minor innovation, showing how Resnick’s notion of information content can be calculated without the use of an external corpus. Rather, when using Resnick’s metric (or that of Lin, or Jiang and Conrath) for measuring the similarity of lexical concepts in WordNet, one can use the category structure of WordNet itself to estimate infor- mation content. Typically, the more general a concept, the more descendants it will possess. Seco et al. thus estimate the information content of a lexical concept as the log of the sum of all its unique descendants (both direct and indirect), divided by the log of the total number of concepts in the entire hierarchy. Not only is this intrinsic view of information content convenient to use, without recourse to an external corpus, Seco et al. show that it offers a better estimate of information content than its extrinsic, corpus-based alternatives, as measured relative to average human similarity ratings for the 30 word-pairs in the Miller & Charles (1991) test set. A similarity measure can draw on other sources of information besides WordNet’s category structures. One might eke out additional information from WordNet’s textual glosses, as in Lesk (1986), or use category structures other than those offered by WordNet. Looking beyond WordNet, entries in the online encyclopedia Wikipedia are not only connected by a dense topology of lateral links, they are also organized by a rich hierarchy of overlapping categories. Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. Nonetheless, WordNet can be a valuable component of a hybrid measure, and Agirre et al. (2009) use an SVM (support vector machine) to combine information from WordNet with information harvested from the web. Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. Similarity is not always applied to pairs of concepts; it is sometimes analogically applied to pairs of pairs of concepts, as in proportional analogies of the form A is to B as C is to D (e.g., hacks are to writers as mercenaries are to soldiers, or chisels are to sculptors as scalpels are to surgeons). In such analogies, one is really assessing the similarity of the unstated relationship between each pair of concepts: thus, mercenaries are soldiers whose allegiance is paid for, much as hacks are writers with income-driven loyalties; sculptors use chisels to carve stone, while surgeons use scalpels to cut or carve flesh. Veale (2004) used WordNet to assess the similarity of A:B to C:D as a function of the combined similarity of A to C and of B to D. In contrast, Turney (2005) used the web to pursue a more divergent course, to represent the tacit relationships of A to B and of C to D as points in a highdimensional space. The dimensions of this space initially correspond to linking phrases on the web, before these dimensions are significantly reduced using singular value decomposition. In the infamous SAT test, an analogy A:B::C:D has four other pairs of concepts that serve as likely distractors (e.g. singer:songwriter for hack:writer) and the goal is to choose the most appropriate C:D pair for a given A:B pairing. Using variants of Wu and Palmer (1994) on the 374 SAT analogies of Turney (2005), Veale (2004) reports a success rate of 38–44% using only WordNet-based similarity. In contrast, Turney (2005) reports up to 55% success on the same analogies, partly because his approach aims 662 to match implicit relations rather than explicit concepts, and in part because it uses a divergent process to gather from the web as rich a perspec- tive as it can on these latent relationships. 2.1 Clever Comparisons Create Similarity Each of these approaches to similarity is a user of information, rather than a creator, and each fails to capture how a creative comparison (such as a metaphor) can spur a listener to view a topic from an atypical perspective. Camac & Glucksberg (1984) provide experimental evidence for the claim that “metaphors do not use preexisting associations to achieve their effects [… ] people use metaphors to create new relations between concepts.” They also offer a salutary reminder of an often overlooked fact: every comparison exploits information, but each is also a source of new information in its own right. Thus, “this cola is acid” reveals a different perspective on cola (e.g. as a corrosive substance or an irritating food) than “this acid is cola” highlights for acid (such as e.g., a familiar substance) Veale & Keane (1994) model the role of similarity in realizing the long-term perlocutionary effect of an informative comparison. For example, to compare surgeons to butchers is to encourage one to see all surgeons as more bloody, … crude or careless. The reverse comparison, of butchers to surgeons, encourages one to see butchers as more skilled and precise. Veale & Keane present a network model of memory, called Sapper, in which activation can spread between related concepts, thus allowing one concept to prime the properties of a neighbor. To interpret an analogy, Sapper lays down new activation-carrying bridges in memory between analogical counterparts, such as between surgeon & butcher, flesh & meat, and scalpel & cleaver. Comparisons can thus have lasting effects on how Sapper sees the world, changing the pattern of activation that arises when it primes a concept. Veale (2003) adopts a similarly dynamic view of similarity in WordNet, showing how an analogical comparison can result in the automatic addition of new categories and relations to WordNet itself. Veale considers the problem of finding an analogical mapping between different parts of WordNet’s noun-sense hierarchy, such as between instances of Greek god and Norse god, or between the letters of different alphabets, such as of Greek and Hebrew. But no structural similarity measure for WordNet exhibits enough discernment to e.g. assign a higher similarity to Zeus & Odin (each is the supreme deity of its pantheon) than to a pairing of Zeus and any other Norse god, just as no structural measure will assign a higher similarity to Alpha & Aleph or to Beta & Beth than to any random letter pairing. A fine-grained category hierarchy permits fine-grained similarity judgments, and though WordNet is useful, its sense hierarchies are not especially fine-grained. However, we can automatically make WordNet subtler and more discerning, by adding new fine-grained categories to unite lexical concepts whose similarity is not reflected by any existing categories. Veale (2003) shows how a property that is found in the glosses of two lexical concepts, of the same depth, can be combined with their LCS to yield a new fine-grained parent category, so e.g. “supreme” + deity = Supreme-deity (for Odin, Zeus, Jupiter, etc.) and “1 st” + letter = 1st-letter (for Alpha, Aleph, etc.) Selected aspects of the textual similarity of two WordNet glosses – the key to similarity in Lesk (1986) – can thus be reified into an explicitly categorical WordNet form. 3 Divergent (Re)Categorization To tap into a richer source of concept properties than WordNet’s glosses, we can use web ngrams. Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). The numbers to the right are Google frequency counts. a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 To find the stable properties that can underpin a meaningful fine-grained category for cowboy, we must seek out the properties that are so often presupposed to be salient of all cowboys that one can use them to anchor a simile, such as

6 0.1289272 238 acl-2013-Measuring semantic content in distributional vectors

7 0.10736459 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

8 0.10504702 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

9 0.10385488 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD

10 0.10192899 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

11 0.094416246 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

12 0.08680515 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian

13 0.086680248 301 acl-2013-Resolving Entity Morphs in Censored Data

14 0.08294297 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics

15 0.082448587 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics

16 0.082079716 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora

17 0.081460319 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference

18 0.081251651 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics

19 0.080385715 154 acl-2013-Extracting bilingual terminologies from comparable corpora

20 0.079876445 27 acl-2013-A Two Level Model for Context Sensitive Inference Rules

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.188), (1, 0.061), (2, 0.028), (3, -0.153), (4, -0.046), (5, -0.109), (6, -0.1), (7, 0.046), (8, 0.0), (9, 0.011), (10, -0.037), (11, 0.046), (12, 0.048), (13, -0.075), (14, 0.066), (15, 0.074), (16, -0.003), (17, -0.045), (18, -0.049), (19, -0.027), (20, 0.045), (21, -0.015), (22, 0.081), (23, -0.008), (24, -0.0), (25, 0.062), (26, 0.026), (27, 0.033), (28, -0.085), (29, 0.006), (30, -0.047), (31, -0.017), (32, 0.01), (33, -0.053), (34, 0.035), (35, 0.075), (36, 0.084), (37, 0.056), (38, 0.035), (39, -0.043), (40, 0.003), (41, -0.015), (42, -0.017), (43, 0.019), (44, 0.008), (45, -0.003), (46, -0.026), (47, -0.044), (48, 0.037), (49, -0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95701939 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

Author: Olivier Ferret

2 0.84145457 238 acl-2013-Measuring semantic content in distributional vectors

Author: Aurelie Herbelot ; Mohan Ganesalingam

Abstract: Some words are more contentful than others: for instance, make is intuitively more general than produce and fifteen is more ‘precise’ than a group. In this paper, we propose to measure the ‘semantic content’ of lexical items, as modelled by distributional representations. We investigate the hypothesis that semantic content can be computed using the KullbackLeibler (KL) divergence, an informationtheoretic measure of the relative entropy of two distributions. In a task focusing on retrieving the correct ordering of hyponym-hypernym pairs, the KL diver- gence achieves close to 80% precision but does not outperform a simpler (linguistically unmotivated) frequency measure. We suggest that this result illustrates the rather ‘intensional’ aspect of distributions.

3 0.80746728 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

Author: Sean Szumlanski ; Fernando Gomez ; Valerie K. Sims

Abstract: We have elicited human quantitative judgments of semantic relatedness for 122 pairs of nouns and compiled them into a new set of relatedness norms that we call Rel-122. Judgments from individual subjects in our study exhibit high average correlation to the resulting relatedness means (r = 0.77, σ = 0.09, N = 73), although not as high as Resnik’s (1995) upper bound for expected average human correlation to similarity means (r = 0.90). This suggests that human perceptions of relatedness are less strictly constrained than perceptions of similarity and establishes a clearer expectation for what constitutes human-like performance by a computational measure of semantic relatedness. We compare the results of several WordNet-based similarity and relatedness measures to our Rel-122 norms and demonstrate the limitations of WordNet for discovering general indications of semantic relatedness. We also offer a critique of the field’s reliance upon similarity norms to evaluate relatedness measures.

4 0.79980397 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

Author: Tony Veale ; Guofu Li

5 0.78354454 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models

Author: Abdellah Fourtassi ; Emmanuel Dupoux

Abstract: Evaluation methods for Distributional Semantic Models typically rely on behaviorally derived gold standards. These methods are difficult to deploy in languages with scarce linguistic/behavioral resources. We introduce a corpus-based measure that evaluates the stability of the lexical semantic similarity space using a pseudo-synonym same-different detection task and no external resources. We show that it enables to predict two behaviorbased measures across a range of parameters in a Latent Semantic Analysis model.

6 0.73487097 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian

7 0.70836323 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us

8 0.7054714 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

9 0.69379944 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

10 0.69213682 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

11 0.69174314 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

12 0.6836853 116 acl-2013-Detecting Metaphor by Contextual Analogy

13 0.67288804 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics

14 0.66346568 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora

15 0.65416312 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit

16 0.63031924 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics

17 0.58140588 242 acl-2013-Mining Equivalent Relations from Linked Data

18 0.57324648 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD

19 0.57298684 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

20 0.55262184 61 acl-2013-Automatic Interpretation of the English Possessive

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.073), (6, 0.065), (11, 0.092), (15, 0.012), (24, 0.116), (26, 0.04), (28, 0.011), (35, 0.109), (42, 0.052), (48, 0.058), (64, 0.012), (70, 0.028), (71, 0.01), (83, 0.128), (88, 0.037), (90, 0.026), (95, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90674269 202 acl-2013-Is a 204 cm Man Tall or Small ? Acquisition of Numerical Common Sense from the Web

Author: Katsuma Narisawa ; Yotaro Watanabe ; Junta Mizuno ; Naoaki Okazaki ; Kentaro Inui

Abstract: This paper presents novel methods for modeling numerical common sense: the ability to infer whether a given number (e.g., three billion) is large, small, or normal for a given context (e.g., number of people facing a water shortage). We first discuss the necessity of numerical common sense in solving textual entailment problems. We explore two approaches for acquiring numerical common sense. Both approaches start with extracting numerical expressions and their context from the Web. One approach estimates the distribution ofnumbers co-occurring within a context and examines whether a given value is large, small, or normal, based on the distri- bution. Another approach utilizes textual patterns with which speakers explicitly expresses their judgment about the value of a numerical expression. Experimental results demonstrate the effectiveness of both approaches.

same-paper 2 0.90207756 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

Author: Olivier Ferret

3 0.89672863 90 acl-2013-Conditional Random Fields for Responsive Surface Realisation using Global Features

Author: Nina Dethlefs ; Helen Hastie ; Heriberto Cuayahuitl ; Oliver Lemon

Abstract: Surface realisers in spoken dialogue systems need to be more responsive than conventional surface realisers. They need to be sensitive to the utterance context as well as robust to partial or changing generator inputs. We formulate surface realisation as a sequence labelling task and combine the use of conditional random fields (CRFs) with semantic trees. Due to their extended notion of context, CRFs are able to take the global utterance context into account and are less constrained by local features than other realisers. This leads to more natural and less repetitive surface realisation. It also allows generation from partial and modified inputs and is therefore applicable to incremental surface realisation. Results from a human rating study confirm that users are sensitive to this extended notion of context and assign ratings that are significantly higher (up to 14%) than those for taking only local context into account.

4 0.83707279 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

Author: Rohan Ramanath ; Monojit Choudhury ; Kalika Bali ; Rishiraj Saha Roy

Abstract: Query segmentation, like text chunking, is the first step towards query understanding. In this study, we explore the effectiveness of crowdsourcing for this task. Through carefully designed control experiments and Inter Annotator Agreement metrics for analysis of experimental data, we show that crowdsourcing may not be a suitable approach for query segmentation because the crowd seems to have a very strong bias towards dividing the query into roughly equal (often only two) parts. Similarly, in the case of hierarchical or nested segmentation, turkers have a strong preference towards balanced binary trees.

5 0.83579534 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction

Author: Bishan Yang ; Claire Cardie

Abstract: This paper addresses the task of finegrained opinion extraction the identification of opinion-related entities: the opinion expressions, the opinion holders, and the targets of the opinions, and the relations between opinion expressions and their targets and holders. Most existing approaches tackle the extraction of opinion entities and opinion relations in a pipelined manner, where the interdependencies among different extraction stages are not captured. We propose a joint inference model that leverages knowledge from predictors that optimize subtasks – of opinion extraction, and seeks a globally optimal solution. Experimental results demonstrate that our joint inference approach significantly outperforms traditional pipeline methods and baselines that tackle subtasks in isolation for the problem of opinion extraction.

6 0.83402646 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

7 0.82631856 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

8 0.82578087 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis

9 0.82348597 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

10 0.81767225 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

11 0.81602705 318 acl-2013-Sentiment Relevance

12 0.81586945 389 acl-2013-Word Association Profiles and their Use for Automated Scoring of Essays

13 0.81455708 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

14 0.81373233 230 acl-2013-Lightly Supervised Learning of Procedural Dialog Systems

15 0.81234354 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context

16 0.81038278 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

17 0.80754995 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

18 0.80729103 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms

19 0.80721319 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

20 0.80613136 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks