acl acl2013 acl2013-198 knowledge-graph by maker-knowledge-mining

198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

Source: pdf

Author: Brijesh Bhatt ; Lahari Poddar ; Pushpak Bhattacharyya

Abstract: We present IndoNet, a multilingual lexical knowledge base for Indian languages. It is a linked structure of wordnets of 18 different Indian languages, Universal Word dictionary and the Suggested Upper Merged Ontology (SUMO). We discuss various benefits of the network and challenges involved in the development. The system is encoded in Lexical Markup Framework (LMF) and we propose modifications in LMF to accommodate Universal Word Dictionary and SUMO. This standardized version of lexical knowledge base of Indian Languages can now easily , be linked to similar global resources.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It is a linked structure of wordnets of 18 different Indian languages, Universal Word dictionary and the Suggested Upper Merged Ontology (SUMO). [sent-2, score-0.401]

2 The system is encoded in Lexical Markup Framework (LMF) and we propose modifications in LMF to accommodate Universal Word Dictionary and SUMO. [sent-4, score-0.061]

3 This standardized version of lexical knowledge base of Indian Languages can now easily , be linked to similar global resources. [sent-5, score-0.25]

4 1 Introduction Lexical resources play an important role in natural language processing tasks. [sent-6, score-0.062]

5 Past couple of decades have shown an immense growth in the development of lexical resources such as wordnet, Wikipedia, ontologies etc. [sent-7, score-0.202]

6 These resources vary significantly in structure and representation formalism. [sent-8, score-0.091]

7 In order to develop applications that can make use of different resources, it is essential to link these heterogeneous resources and develop a common representation framework. [sent-9, score-0.228]

8 However, the differences in encoding of knowledge and multilinguality are the major road blocks in development of such a framework. [sent-10, score-0.048]

9 Particularly, in a multilingual country like India, information is available in many different languages. [sent-11, score-0.092]

10 In order to exchange information across cultures and languages, it is essential to create an architecture to share various lexical resources across languages. [sent-12, score-0.17]

11 In this paper we present IndoNet, a lexical re- source created by merging wordnets of 18 difpb } @ c s e . [sent-13, score-0.257]

12 , 1999) and an upper ontology, SUMO (Niles and Pease, 2001). [sent-17, score-0.06]

13 Universal Word (UW), defined by a headword and a set of restrictions which give an unambiguous representation of the concept, forms the vocabulary ofUniversal Networking Language. [sent-18, score-0.156]

14 Suggested Upper Merged Ontology (SUMO) is the largest freely available ontology which is linked to the entire English WordNet (Niles and Pease, 2003). [sent-19, score-0.343]

15 Though UNL is a graph based representation and SUMO is a formal ontology, both provide language independent conceptualization. [sent-20, score-0.029]

16 IndoNet is encoded in Lexical Markup Framework (LMF), an ISO standard (ISO-24613) for encoding lexical resources (Francopoulo et al. [sent-22, score-0.142]

17 We propose an architecture to link lexical re- sources of Indian languages. [sent-25, score-0.149]

18 We propose modifications in Lexical Markup Framework to create a linked structure of multilingual lexical resources and ontology. [sent-27, score-0.383]

19 2 Related Work Over the years wordnet has emerged as the most widely used lexical resource. [sent-28, score-0.15]

20 Though most of the wordnets are built by following the standards laid by English Wordnet (Fellbaum, 1998), their conceptualizations differ because of the differences in lexicalization of concepts across languages. [sent-29, score-0.372]

21 ‘Not 1Wordnets for Indian languages are developed in IndoWordNet project. [sent-30, score-0.054]

22 These languages covers 3 different language families, Indo Aryan, Sino-Tebetian and Dravidian. [sent-32, score-0.054]

23 in/ indowordnet itb 268 ProceedingSsof oifa, th Beu 5l1gsarti Aan,An uuaglu Mste 4e-ti9n2g 0 o1f3 t. [sent-37, score-0.144]

24 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioinngauli Lsitnicgsu,i psatgices 268–272, only that, there exist lexical gaps where a word in one language has no correspondence in another language, but there are differences in the ways languages structure their words and concepts’ . [sent-39, score-0.134]

25 The challenge of constructing a unified multilingual resource was first addressed in EuroWordNet (Vossen, 1998). [sent-41, score-0.13]

26 EuroWordNet linked wordnets of 8 different European languages through a common interlingual index (ILI). [sent-42, score-0.5]

27 ILI consists of English synsets and serves as a pivot to link other wordnets. [sent-43, score-0.15]

28 While ILI allows each language wordnet to preserve its semantic structure, it has two basic drawbacks as described in Fellbaum and Vossen (2012), 1. [sent-44, score-0.097]

29 An ILI tied to one specific language clearly reflects only the inventory of the language it is based on, and gaps show up when lexicons of different languages are mapped to it. [sent-45, score-0.119]

30 Subsequently in KYOTO project2, ontologies are preferred over ILI for linking of concepts of different languages. [sent-48, score-0.264]

31 Ontologies provide language indpendent conceptualization, hence the linking remains unbiased to a particular language. [sent-49, score-0.042]

32 Top level ontology SUMO is used to link common base concepts across languages. [sent-50, score-0.438]

33 Because of the small size of the top level ontology, only a few wordnet synsets can be linked directly to the ontological concept and most of the synsets get linked through subsumption relation. [sent-51, score-0.679]

34 ‘LMF provides a common model for the creation and use of lexical resources, to manage the exchange of data among these resources, and to enable the merging of a large number of individual electronic resources to form extensive global electronic resources’ (Francopoulo et al. [sent-55, score-0.265]

35 (2009) proposed WordNet-LMF to represent wordnets in LMF format. [sent-58, score-0.173]

36 Henrich and Hinrichs (2010) have further modified Wordnet-LMF to accommodate lexical 2http : / / kyot o-pro j e ct . [sent-59, score-0.135]

37 LMF also provides extensions for multilingual lexicons and for linking external resources, such as ontology. [sent-65, score-0.172]

38 However, LMF does not explicitly define standards to share a common ontology among multilingual lexicons. [sent-66, score-0.372]

39 Our work falls in line with EuroWordNet and Kyoto except for the following key differences, • • Instead of using ILI, we use a ‘common concept hierarchy’ as a b,a wcekb usonee a t ‘oc loinmkm leonxi ccoonnsof different languages. [sent-67, score-0.159]

40 In addition to an upper ontology, a concept in common concept hierarchy oisg ya,l aso c olinnckeepdt to Universal Word Dictionary. [sent-68, score-0.563]

41 Universal Word dictionary provides additional semantic information regarding argument types of verbs, that can be used to provide clues for selectional preference of a verb. [sent-69, score-0.074]

42 • 3 We refine LMF to link external resources (e. [sent-70, score-0.128]

43 ontologies) owi ltihn multilingual lsoexuirccoens and to represent Universal Word Dictionary. [sent-72, score-0.092]

44 IndoNet IndoNet uses a common concept hierarchy to link various heterogeneous lexical resources. [sent-73, score-0.465]

45 As shown in figure 1, concepts of different wordnets, Universal Word Dictionary and Upper Ontology are merged to form the common concept hierarchy. [sent-74, score-0.391]

46 Figure 1 shows how concepts of English WordNet (EWN), Hindi Wordnet (HWN), upper ontology (SUMO) and Universal Word Dictionary (UWD) are linked through common concept hierarchy (CCH). [sent-75, score-0.861]

47 This section provides details of Common Concept Hierarcy and LMF encoding for different resources. [sent-76, score-0.027]

48 Figure 1: An Example of Indonet Structure 269 Figure 2: LMF representation for Universal Word Dictionary 3. [sent-77, score-0.029]

49 1 Common Concept Hierarchy (CCH) The common concept hierarchy is an abstract pivot index to link lexical resources of all languages. [sent-78, score-0.553]

50 An element of a common concept hierarchy is defined as < sinid1 , sinid2, . [sent-79, score-0.323]

51 , uwid, sumoid > where, sinidi is synset id of ith wordnet, uw id is universal word id, and sumo id is SUMO term id of the concept. [sent-82, score-0.848]

52 Unlike ILI, the hypernymy-hyponymy relations from different wordnets are merged to construct the concept hierarchy. [sent-83, score-0.403]

53 Each synset of wordnet is directly linked to a concept in ‘common concept hierarchy’ . [sent-84, score-0.595]

54 However IndoWordnet encodes more lexical relations compared to EuroWordnet. [sent-88, score-0.075]

55 We enhanced the Wordnet-LMF to accommodate the following relations: antonym, gradation, hypernymy, meronym, troponymy, entailment and cross part of speech links for ability and capability. [sent-89, score-0.039]

56 3 LMF for Universal Word Dictionary A Universal Word is composed of a headword and a list of restrictions, that provide unique meaning of the UW. [sent-91, score-0.071]

57 In our architecture we allow each sense of a headword to have more than one set of restrictions (defined by different UW dictionaries) and be linked to lemmas of multiple languages with a confidence score. [sent-92, score-0.41]

58 This allows us to merge multiple UW dictionaries and represent it in LMF format. [sent-93, score-0.029]

59 We introduce four new LMF classes; Restrictions, Restriction, Lemmas and Lemma and add new attributes; headword and mapping score to existing LMF classes. [sent-94, score-0.071]

60 Figure 2 shows an example of LMF representation of UW Dictionary. [sent-95, score-0.029]

61 At present, the dictionary is created by merging two dictionaries, UW++ (Boguslavsky et al. [sent-96, score-0.105]

62 Lemmas from different languages are mapped to universal words and stored under the Lemmas class. [sent-98, score-0.218]

63 4 LMF to link ontology with Common Concept Hierarchy Figure 3 shows an example LMF representation of CCH. [sent-100, score-0.284]

64 Concepts in different re- sources are linked to the SenseAxis in such a way that concepts linked to same SenseAxis convey the same Sense. [sent-102, score-0.443]

65 Using LMF class MonolingualExternalRefs, ontology can be integrated with a monolingual lexicon. [sent-103, score-0.189]

66 In order to share an ontology among multilingual resources, we modify the original core package of LMF. [sent-104, score-0.281]

67 As shown in figure 3, a SUMO term is shared across multiple lexicons via the SenseAxis. [sent-105, score-0.038]

68 SUMO is linked with concept hierarchy using the follow3http : / /www . [sent-106, score-0.429]

69 in/ ˜hdi ct / itb webint e r face_us e r / 270 Figure 3: LMF representation ing relations: for Common Concept Hierarchy antonym, hypernym, instance and equivalent. [sent-110, score-0.1]

70 In order to support these relations, Reltype attribute is added to the interlingual Sense class. [sent-111, score-0.071]

71 4 Observation Table 1 shows part of speech wise status of linked concepts4. [sent-112, score-0.154]

72 The concept hierarchy contains 53848 concepts which are shared among wordnets of Indian languages, SUMO and Universal Word Dictionary. [sent-113, score-0.583]

73 Out ofthe total 53848 concepts, 21984 are linked to SUMO, 34114 are linked to HWN and 44119 are linked to UW. [sent-114, score-0.462]

74 Among these, 12,254 are common between UW and SUMO and 21984 are common between wordnet and SUMO. [sent-115, score-0.193]

75 Statistics for other wordnets can be found at http : / /www . [sent-119, score-0.173]

76 php itb shown in Figure 1, ‘uncle ’ is an English language concept defined as ‘the brother of your father or mother’ . [sent-124, score-0.36]

77 Hindi has no concept equivalent to ‘uncle’ but there are two more specific concepts ‘kaka ’, ‘brother of father. [sent-125, score-0.294]

78 ’ The lexical gap is captured when these concepts are linked to CCH. [sent-127, score-0.342]

79 Through CCH, these concepts are linked to SUMO term ‘FamilyRelation ’ which shows relation between these concepts. [sent-128, score-0.289]

80 Universal Word Dictionary captures exact relation between these concepts by applying restrictions [chacha] uncle(icl>brother (mod>father)) and [mama] uncle(icl>brother (mod>mother)). [sent-129, score-0.191]

81 This makes it possible to link concepts across lan- guages. [sent-130, score-0.201]

82 5 Conclusion We have presented a multilingual lexical resource for Indian languages. [sent-131, score-0.183]

83 The proposed architecture handles the ‘lexical gap’ and ‘structural divergence’ among languages, by building a common concept hierarchy. [sent-132, score-0.237]

84 In order to encode this resource in LMF, we developed standards to represent UW in LMF. [sent-133, score-0.081]

85 IndoNet is emerging as the largest multilingual resource covering 18 languages of 3 different language families and it is possible to link or merge other standardized lexical resources with it. [sent-134, score-0.432]

86 Since Universal Word dictionary is an integral part of the system, it can be used for UNL based 271 Machine Translation tasks. [sent-135, score-0.074]

87 Ontological structure of the system can be used for multilingual information retrieval and extraction. [sent-136, score-0.092]

88 In future, we aim to address ontological issues of the common concept hierarchy and integrate domain ontologies with the system. [sent-137, score-0.455]

89 We are also aiming to develop standards to evaluate such multilingual resources and to validate axiomatic foundation of the same. [sent-138, score-0.218]

90 We plan to make this resource freely available to researchers. [sent-139, score-0.038]

91 Multilingual resources for NLP in the lexical markup framework (LMF). [sent-160, score-0.178]

92 Standardizing wordnets in the ISO standard LMF: WordnetLMF for GermaNet. [sent-164, score-0.173]

93 Formal ontology as interlingua: The SUMO and WordNet linking project and global wordnet. [sent-177, score-0.231]

94 Wordnet-LMF: fleshing out a standardized format for wordnet interoperability. [sent-182, score-0.14]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lmf', 0.58), ('sumo', 0.363), ('indonet', 0.193), ('ontology', 0.189), ('wordnets', 0.173), ('ili', 0.169), ('universal', 0.164), ('concept', 0.159), ('indian', 0.158), ('linked', 0.154), ('concepts', 0.135), ('uw', 0.119), ('hierarchy', 0.116), ('brother', 0.099), ('wordnet', 0.097), ('francopoulo', 0.097), ('niles', 0.097), ('uncle', 0.097), ('multilingual', 0.092), ('ontologies', 0.087), ('pease', 0.085), ('eurowordnet', 0.079), ('dictionary', 0.074), ('cch', 0.073), ('indowordnet', 0.073), ('soria', 0.073), ('headword', 0.071), ('itb', 0.071), ('interlingual', 0.071), ('link', 0.066), ('markup', 0.063), ('resources', 0.062), ('upper', 0.06), ('restrictions', 0.056), ('languages', 0.054), ('lexical', 0.053), ('piek', 0.051), ('vossen', 0.051), ('hindi', 0.05), ('pivot', 0.049), ('merged', 0.049), ('hwn', 0.048), ('senseaxis', 0.048), ('common', 0.048), ('lemmas', 0.045), ('ontological', 0.045), ('id', 0.044), ('standardized', 0.043), ('standards', 0.043), ('henrich', 0.043), ('filt', 0.043), ('icl', 0.043), ('kyot', 0.043), ('mama', 0.043), ('uchida', 0.043), ('linking', 0.042), ('unl', 0.039), ('accommodate', 0.039), ('lexicons', 0.038), ('kyoto', 0.038), ('resource', 0.038), ('monachini', 0.037), ('fellbaum', 0.036), ('india', 0.036), ('boguslavsky', 0.035), ('synsets', 0.035), ('monica', 0.033), ('iso', 0.031), ('merging', 0.031), ('father', 0.031), ('antonym', 0.031), ('architecture', 0.03), ('christiane', 0.03), ('representation', 0.029), ('dictionaries', 0.029), ('mod', 0.029), ('mother', 0.028), ('gaps', 0.027), ('encoding', 0.027), ('claudia', 0.026), ('synset', 0.026), ('suggested', 0.025), ('exchange', 0.025), ('families', 0.024), ('ministry', 0.024), ('adam', 0.024), ('electronic', 0.023), ('heterogeneous', 0.023), ('relations', 0.022), ('modifications', 0.022), ('lig', 0.021), ('multilinguality', 0.021), ('kinship', 0.021), ('axiomatic', 0.021), ('ike', 0.021), ('norwell', 0.021), ('aso', 0.021), ('assamese', 0.021), ('conceptualizations', 0.021), ('gujarati', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

Author: Brijesh Bhatt ; Lahari Poddar ; Pushpak Bhattacharyya

2 0.17848065 234 acl-2013-Linking and Extending an Open Multilingual Wordnet

Author: Francis Bond ; Ryan Foster

Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.

3 0.11404039 242 acl-2013-Mining Equivalent Relations from Linked Data

Author: Ziqi Zhang ; Anna Lisa Gentile ; Isabelle Augenstein ; Eva Blomqvist ; Fabio Ciravegna

Abstract: Linking heterogeneous resources is a major research challenge in the Semantic Web. This paper studies the task of mining equivalent relations from Linked Data, which was insufficiently addressed before. We introduce an unsupervised method to measure equivalency of relation pairs and cluster equivalent relations. Early experiments have shown encouraging results with an average of 0.75~0.87 precision in predicting relation pair equivalency and 0.78~0.98 precision in relation clustering. 1

4 0.077100649 366 acl-2013-Understanding Verbs based on Overlapping Verbs Senses

Author: Kavitha Rajan

Abstract: Natural language can be easily understood by everyone irrespective of their differences in age or region or qualification. The existence of a conceptual base that underlies all natural languages is an accepted claim as pointed out by Schank in his Conceptual Dependency (CD) theory. Inspired by the CD theory and theories in Indian grammatical tradition, we propose a new set of meaning primitives in this paper. We claim that this new set of primitives captures the meaning inherent in verbs and help in forming an inter-lingual and computable ontological classification of verbs. We have identified seven primitive overlapping verb senses which substantiate our claim. The percentage of coverage of these primitives is 100% for all verbs in Sanskrit and Hindi and 3750 verbs in English. 1

5 0.075090125 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting

Author: Ankit Ramteke ; Akshat Malu ; Pushpak Bhattacharyya ; J. Saketha Nath

Abstract: Thwarting and sarcasm are two uncharted territories in sentiment analysis, the former because of the lack of training corpora and the latter because of the enormous amount of world knowledge it demands. In this paper, we propose a working definition of thwarting amenable to machine learning and create a system that detects if the document is thwarted or not. We focus on identifying thwarting in product reviews, especially in the camera domain. An ontology of the camera domain is created. Thwarting is looked upon as the phenomenon of polarity reversal at a higher level of ontology compared to the polarity expressed at the lower level. This notion of thwarting defined with respect to an ontology is novel, to the best of our knowledge. A rule based implementation building upon this idea forms our baseline. We show that machine learning with annotated corpora (thwarted/nonthwarted) is more effective than the rule based system. Because of the skewed distribution of thwarting, we adopt the Areaunder-the-Curve measure of performance. To the best of our knowledge, this is the first attempt at the difficult problem of thwarting detection, which we hope will at Akshat Malu Dept. of Computer Science & Engg., Indian Institute of Technology Bombay, Mumbai, India. akshatmalu@ cse .i itb .ac .in J. Saketha Nath Dept. of Computer Science & Engg., Indian Institute of Technology Bombay, Mumbai, India. s aketh@ cse .i itb .ac .in least provide a baseline system to compare against. 1 Credits The authors thank the lexicographers at Center for Indian Language Technology (CFILT) at IIT Bombay for their support for this work. 2

6 0.071532063 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

7 0.070223548 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

8 0.068039559 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

9 0.063194059 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

10 0.05039826 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language

11 0.046042569 249 acl-2013-Models of Semantic Representation with Visual Attributes

12 0.045859151 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context

13 0.044446014 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection

14 0.043217055 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

15 0.042539686 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us

16 0.041070495 62 acl-2013-Automatic Term Ambiguity Detection

17 0.040867701 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

18 0.040128581 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

19 0.040005621 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies

20 0.0388413 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.091), (1, 0.033), (2, 0.008), (3, -0.066), (4, -0.045), (5, -0.062), (6, -0.081), (7, 0.039), (8, 0.075), (9, -0.038), (10, -0.009), (11, -0.014), (12, -0.047), (13, 0.022), (14, 0.026), (15, 0.013), (16, 0.023), (17, -0.006), (18, -0.033), (19, 0.009), (20, -0.047), (21, 0.027), (22, -0.01), (23, -0.016), (24, 0.055), (25, -0.039), (26, -0.009), (27, -0.046), (28, -0.003), (29, 0.005), (30, 0.055), (31, 0.02), (32, 0.021), (33, 0.006), (34, 0.078), (35, -0.037), (36, -0.011), (37, -0.006), (38, 0.043), (39, 0.008), (40, -0.01), (41, 0.012), (42, -0.031), (43, 0.042), (44, -0.073), (45, -0.052), (46, -0.004), (47, -0.087), (48, -0.043), (49, -0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92515934 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

Author: Brijesh Bhatt ; Lahari Poddar ; Pushpak Bhattacharyya

2 0.79501969 234 acl-2013-Linking and Extending an Open Multilingual Wordnet

Author: Francis Bond ; Ryan Foster

3 0.61993343 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection

Author: Silvana Hartmann ; Iryna Gurevych

Abstract: We present a new bilingual FrameNet lexicon for English and German. It is created through a simple, but powerful approach to construct a FrameNet in any language using Wiktionary as an interlingual representation. Our approach is based on a sense alignment of FrameNet and Wiktionary, and subsequent translation disambiguation into the target language. We perform a detailed evaluation of the created resource and a discussion of Wiktionary as an interlingual connection for the cross-language transfer of lexicalsemantic resources. The created resource is publicly available at http : / /www . ukp .tu-darmst adt .de / fnwkde / .

4 0.56209481 366 acl-2013-Understanding Verbs based on Overlapping Verbs Senses

Author: Kavitha Rajan

5 0.55674237 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments

Author: Tony Veale ; Guofu Li

Abstract: Just as observing is more than just seeing, comparing is far more than mere matching. It takes understanding, and even inventiveness, to discern a useful basis for judging two ideas as similar in a particular context, especially when our perspective is shaped by an act of linguistic creativity such as metaphor, simile or analogy. Structured resources such as WordNet offer a convenient hierarchical means for converging on a common ground for comparison, but offer little support for the divergent thinking that is needed to creatively view one concept as another. We describe such a means here, by showing how the web can be used to harvest many divergent views for many familiar ideas. These lateral views complement the vertical views of WordNet, and support a system for idea exploration called Thesaurus Rex. We show also how Thesaurus Rex supports a novel, generative similarity measure for WordNet. 1 Seeing is Believing (and Creating) Similarity is a cognitive phenomenon that is both complex and subjective, yet for practical reasons it is often modeled as if it were simple and objective. This makes sense for the many situations where we want to align our similarity judgments with those of others, and thus focus on the same conventional properties that others are also likely to focus upon. This reliance on the consensus viewpoint explains why WordNet (Fellbaum, 1998) has proven so useful as a basis for computational measures of lexico-semantic similarity Guofu Li School of Computer Science and Informatics, University College Dublin, Belfield, Dublin D2, Ireland. l .guo fu . l gmai l i @ .com (e.g. see Pederson et al. 2004, Budanitsky & Hirst, 2006; Seco et al. 2006). These measures reduce the similarity of two lexical concepts to a single number, by viewing similarity as an objective estimate of the overlap in their salient qualities. This convenient perspective is poorly suited to creative or insightful comparisons, but it is sufficient for the many mundane comparisons we often perform in daily life, such as when we organize books or look for items in a supermarket. So if we do not know in which aisle to locate a given item (such as oatmeal), we may tacitly know how to locate a similar product (such as cornflakes) and orient ourselves accordingly. Yet there are occasions when the recognition of similarities spurs the creation of similarities, when the act of comparison spurs us to invent new ways of looking at an idea. By placing pop tarts in the breakfast aisle, food manufacturers encourage us to view them as a breakfast food that is not dissimilar to oatmeal or cornflakes. When ex-PM Tony Blair published his memoirs, a mischievous activist encouraged others to move his book from Biography to Fiction in bookshops, in the hope that buyers would see it in a new light. Whenever we use a novel metaphor to convey a non-obvious viewpoint on a topic, such as “cigarettes are time bombs”, the comparison may spur us to insight, to see aspects of the topic that make it more similar to the vehicle (see Ortony, 1979; Veale & Hao, 2007). In formal terms, assume agent A has an insight about concept X, and uses the metaphor X is a Y to also provoke this insight in agent B. To arrive at this insight for itself, B must intuit what X and Y have in common. But this commonality is surely more than a standard categorization of X, or else it would not count as an insight about X. To understand the metaphor, B must place X 660 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 6 0–670, in a new category, so that X can be seen as more similar to Y. Metaphors shape the way we per- ceive the world by re-shaping the way we make similarity judgments. So if we want to imbue computers with the ability to make and to understand creative metaphors, we must first give them the ability to look beyond the narrow viewpoints of conventional resources. Any measure that models similarity as an objective function of a conventional worldview employs a convergent thought process. Using WordNet, for instance, a similarity measure can vertically converge on a common superordinate category of both inputs, and generate a single numeric result based on their distance to, and the information content of, this common generalization. So to find the most conventional ways of seeing a lexical concept, one simply ascends a narrowing concept hierarchy, using a process de Bono (1970) calls vertical thinking. To find novel, non-obvious and useful ways of looking at a lexical concept, one must use what Guilford (1967) calls divergent thinking and what de Bono calls lateral thinking. These processes cut across familiar category boundaries, to simultaneously place a concept in many different categories so that we can see it in many different ways. de Bono argues that vertical thinking is selective while lateral thinking is generative. Whereas vertical thinking concerns itself with the “right” way or a single “best” way of looking at things, lateral thinking focuses on producing alternatives to the status quo. To be as useful for creative tasks as they are for conventional tasks, we need to re-imagine our computational similarity measures as generative rather than selective, expansive rather than reductive, divergent as well as convergent and lateral as well as vertical. Though WordNet is ideally structured to support vertical, convergent reasoning, its comprehensive nature means it can also be used as a solid foundation for building a more lateral and divergent model of similarity. Here we will use the web as a source of diverse perspectives on familiar ideas, to complement the conventional and often narrow views codified by WordNet. Section 2 provides a brief overview of past work in the area of similarity measurement, before section 3 describes a simple bootstrapping loop for acquiring richly diverse perspectives from the web for a wide variety of familiar ideas. These perspectives are used to enhance a Word- Net-based measure of lexico-semantic similarity in section 4, by broadening the range of informative viewpoints the measure can select from. Similarity is thus modeled as a process that is both generative and selective. This lateral-andvertical approach is evaluated in section 5, on the Miller & Charles (1991) data-set. A web app for the lateral exploration of diverse viewpoints, named Thesaurus Rex, is also presented, before closing remarks are offered in section 6. 2 Related Work and Ideas WordNet’s taxonomic organization of nounsenses and verb-senses – in which very general categories are successively divided into increasingly informative sub-categories or instancelevel ideas – allows us to gauge the overlap in information content, and thus of meaning, of two lexical concepts. We need only identify the deepest point in the taxonomy at which this content starts to diverge. This point of divergence is often called the LCS, or least common subsumer, of two concepts (Pederson et al., 2004). Since sub-categories add new properties to those they inherit from their parents – Aristotle called these properties the differentia that stop a category system from trivially collapsing into itself – the depth of a lexical concept in a taxonomy is an intuitive proxy for its information content. Wu & Palmer (1994) use the depth of a lexical concept in the WordNet hierarchy as such a proxy, and thereby estimate the similarity of two lexical concepts as twice the depth of their LCS divided by the sum of their individual depths. Leacock and Chodorow (1998) instead use the length of the shortest path between two concepts as a proxy for the conceptual distance between them. To connect any two ideas in a hierarchical system, one must vertically ascend the hierarchy from one concept, change direction at a potential LCS, and then descend the hierarchy to reach the second concept. (Aristotle was also first to suggest this approach in his Poetics). Leacock and Chodorow normalize the length of this path by dividing its size (in nodes) by twice the depth of the deepest concept in the hierarchy; the latter is an upper bound on the distance between any two concepts in the hierarchy. Negating the log of this normalized length yields a corresponding similarity score. While the role of an LCS is merely implied in Leacock and Chodorow’s use of a shortest path, the LCS is pivotal nonetheless, and like that of Wu & Palmer, the approach uses an essentially vertical reasoning process to identify a single “best” generalization. Depth is a convenient proxy for information content, but more nuanced proxies can yield 661 more rounded similarity measures. Resnick (1995) draws on information theory to define the information content of a lexical concept as the negative log likelihood of its occurrence in a corpus, either explicitly (via a direct mention) or by presupposition (via a mention of any of its sub-categories or instances). Since the likelihood of a general category occurring in a corpus is higher than that of any of its sub-categories or instances, such categories are more predictable, and less informative, than rarer categories whose occurrences are less predictable and thus more informative. The negative log likelihood of the most informative LCS of two lexical concepts offers a reliable estimate of the amount of infor- mation shared by those concepts, and thus a good estimate of their similarity. Lin (1998) combines the intuitions behind Resnick’s metric and that of Wu and Palmer to estimate the similarity of two lexical concepts as an information ratio: twice the information content of their LCS divided by the sum of their individual information contents. Jiang and Conrath (1997) consider the converse notion of dissimilarity, noting that two lexical concepts are dissimilar to the extent that each contains information that is not shared by the other. So if the information content of their most informative LCS is a good measure of what they do share, then the sum of their individual information contents, minus twice the content of their most informative LCS, is a reliable estimate of their dissimilarity. Seco et al. (2006) presents a minor innovation, showing how Resnick’s notion of information content can be calculated without the use of an external corpus. Rather, when using Resnick’s metric (or that of Lin, or Jiang and Conrath) for measuring the similarity of lexical concepts in WordNet, one can use the category structure of WordNet itself to estimate infor- mation content. Typically, the more general a concept, the more descendants it will possess. Seco et al. thus estimate the information content of a lexical concept as the log of the sum of all its unique descendants (both direct and indirect), divided by the log of the total number of concepts in the entire hierarchy. Not only is this intrinsic view of information content convenient to use, without recourse to an external corpus, Seco et al. show that it offers a better estimate of information content than its extrinsic, corpus-based alternatives, as measured relative to average human similarity ratings for the 30 word-pairs in the Miller & Charles (1991) test set. A similarity measure can draw on other sources of information besides WordNet’s category structures. One might eke out additional information from WordNet’s textual glosses, as in Lesk (1986), or use category structures other than those offered by WordNet. Looking beyond WordNet, entries in the online encyclopedia Wikipedia are not only connected by a dense topology of lateral links, they are also organized by a rich hierarchy of overlapping categories. Strube and Ponzetto (2006) show how Wikipedia can support a measure of similarity (and relatedness) that better approximates human judgments than many WordNet-based measures. Nonetheless, WordNet can be a valuable component of a hybrid measure, and Agirre et al. (2009) use an SVM (support vector machine) to combine information from WordNet with information harvested from the web. Their best similarity measure achieves a remarkable 0.93 correlation with human judgments on the Miller & Charles word-pair set. Similarity is not always applied to pairs of concepts; it is sometimes analogically applied to pairs of pairs of concepts, as in proportional analogies of the form A is to B as C is to D (e.g., hacks are to writers as mercenaries are to soldiers, or chisels are to sculptors as scalpels are to surgeons). In such analogies, one is really assessing the similarity of the unstated relationship between each pair of concepts: thus, mercenaries are soldiers whose allegiance is paid for, much as hacks are writers with income-driven loyalties; sculptors use chisels to carve stone, while surgeons use scalpels to cut or carve flesh. Veale (2004) used WordNet to assess the similarity of A:B to C:D as a function of the combined similarity of A to C and of B to D. In contrast, Turney (2005) used the web to pursue a more divergent course, to represent the tacit relationships of A to B and of C to D as points in a highdimensional space. The dimensions of this space initially correspond to linking phrases on the web, before these dimensions are significantly reduced using singular value decomposition. In the infamous SAT test, an analogy A:B::C:D has four other pairs of concepts that serve as likely distractors (e.g. singer:songwriter for hack:writer) and the goal is to choose the most appropriate C:D pair for a given A:B pairing. Using variants of Wu and Palmer (1994) on the 374 SAT analogies of Turney (2005), Veale (2004) reports a success rate of 38–44% using only WordNet-based similarity. In contrast, Turney (2005) reports up to 55% success on the same analogies, partly because his approach aims 662 to match implicit relations rather than explicit concepts, and in part because it uses a divergent process to gather from the web as rich a perspec- tive as it can on these latent relationships. 2.1 Clever Comparisons Create Similarity Each of these approaches to similarity is a user of information, rather than a creator, and each fails to capture how a creative comparison (such as a metaphor) can spur a listener to view a topic from an atypical perspective. Camac & Glucksberg (1984) provide experimental evidence for the claim that “metaphors do not use preexisting associations to achieve their effects [… ] people use metaphors to create new relations between concepts.” They also offer a salutary reminder of an often overlooked fact: every comparison exploits information, but each is also a source of new information in its own right. Thus, “this cola is acid” reveals a different perspective on cola (e.g. as a corrosive substance or an irritating food) than “this acid is cola” highlights for acid (such as e.g., a familiar substance) Veale & Keane (1994) model the role of similarity in realizing the long-term perlocutionary effect of an informative comparison. For example, to compare surgeons to butchers is to encourage one to see all surgeons as more bloody, … crude or careless. The reverse comparison, of butchers to surgeons, encourages one to see butchers as more skilled and precise. Veale & Keane present a network model of memory, called Sapper, in which activation can spread between related concepts, thus allowing one concept to prime the properties of a neighbor. To interpret an analogy, Sapper lays down new activation-carrying bridges in memory between analogical counterparts, such as between surgeon & butcher, flesh & meat, and scalpel & cleaver. Comparisons can thus have lasting effects on how Sapper sees the world, changing the pattern of activation that arises when it primes a concept. Veale (2003) adopts a similarly dynamic view of similarity in WordNet, showing how an analogical comparison can result in the automatic addition of new categories and relations to WordNet itself. Veale considers the problem of finding an analogical mapping between different parts of WordNet’s noun-sense hierarchy, such as between instances of Greek god and Norse god, or between the letters of different alphabets, such as of Greek and Hebrew. But no structural similarity measure for WordNet exhibits enough discernment to e.g. assign a higher similarity to Zeus & Odin (each is the supreme deity of its pantheon) than to a pairing of Zeus and any other Norse god, just as no structural measure will assign a higher similarity to Alpha & Aleph or to Beta & Beth than to any random letter pairing. A fine-grained category hierarchy permits fine-grained similarity judgments, and though WordNet is useful, its sense hierarchies are not especially fine-grained. However, we can automatically make WordNet subtler and more discerning, by adding new fine-grained categories to unite lexical concepts whose similarity is not reflected by any existing categories. Veale (2003) shows how a property that is found in the glosses of two lexical concepts, of the same depth, can be combined with their LCS to yield a new fine-grained parent category, so e.g. “supreme” + deity = Supreme-deity (for Odin, Zeus, Jupiter, etc.) and “1 st” + letter = 1st-letter (for Alpha, Aleph, etc.) Selected aspects of the textual similarity of two WordNet glosses – the key to similarity in Lesk (1986) – can thus be reified into an explicitly categorical WordNet form. 3 Divergent (Re)Categorization To tap into a richer source of concept properties than WordNet’s glosses, we can use web ngrams. Consider these descriptions of a cowboy from the Google n-grams (Brants & Franz, 2006). The numbers to the right are Google frequency counts. a lonesome cowboy 432 a mounted cowboy 122 a grizzled cowboy 74 a swaggering cowboy 68 To find the stable properties that can underpin a meaningful fine-grained category for cowboy, we must seek out the properties that are so often presupposed to be salient of all cowboys that one can use them to anchor a simile, such as

6 0.54313648 242 acl-2013-Mining Equivalent Relations from Linked Data

7 0.50807148 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

8 0.50517631 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections

9 0.4955062 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

10 0.47694412 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us

11 0.47550821 344 acl-2013-The Effects of Lexical Resource Quality on Preference Violation Detection

12 0.4571813 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

13 0.4552688 61 acl-2013-Automatic Interpretation of the English Possessive

14 0.44601533 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics

15 0.43949458 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach

16 0.43177015 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

17 0.42446798 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context

18 0.41918966 367 acl-2013-Universal Conceptual Cognitive Annotation (UCCA)

19 0.41587022 116 acl-2013-Detecting Metaphor by Contextual Analogy

20 0.41238788 6 acl-2013-A Java Framework for Multilingual Definition and Hypernym Extraction

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.064), (6, 0.017), (11, 0.047), (15, 0.019), (24, 0.034), (26, 0.08), (31, 0.032), (35, 0.055), (42, 0.037), (48, 0.058), (62, 0.3), (70, 0.024), (88, 0.05), (90, 0.018), (95, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75408727 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

Author: Brijesh Bhatt ; Lahari Poddar ; Pushpak Bhattacharyya

2 0.68838215 116 acl-2013-Detecting Metaphor by Contextual Analogy

Author: Eirini Florou

Abstract: As one of the most challenging issues in NLP, metaphor identification and its interpretation have seen many models and methods proposed. This paper presents a study on metaphor identification based on the semantic similarity between literal and non literal meanings of words that can appear at the same context.

3 0.64509577 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

Abstract: Developing sophisticated NLP pipelines composed of multiple processing tools and components available through different providers may pose a challenge in terms of their interoperability. The Unstructured Information Management Architecture (UIMA) is an industry standard whose aim is to ensure such interoperability by defining common data structures and interfaces. The architecture has been gaining attention from industry and academia alike, resulting in a large volume ofUIMA-compliant processing components. In this paper, we demonstrate Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows. The workbench is based upon UIMA, and thus has the potential of using many of the existing UIMA resources. We present features, and show examples, offacilitating the distributed development of components and the analysis of processing results. The latter includes annotation visualisers and editors, as well as serialisation to RDF format, which enables flexible querying in addition to data manipulation thanks to the semantic query language SPARQL. The distributed development feature allows users to seamlessly connect their tools to workflows running in Argo, and thus take advantage of both the available library of components (without the need of installing them locally) and the analytical tools.

4 0.48259395 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset

Author: Mohamed Aly ; Amir Atiya

Abstract: We introduce LABR, the largest sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars. We investigate the properties of the the dataset, and present its statistics. We explore using the dataset for two tasks: sentiment polarity classification and rating classification. We provide standard splits of the dataset into training and testing, for both polarity and rating classification, in both balanced and unbalanced settings. We run baseline experiments on the dataset to establish a benchmark.

5 0.48090589 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

Author: Rui Xia ; Tao Wang ; Xuelei Hu ; Shoushan Li ; Chengqing Zong

Abstract: Bag-of-words (BOW) is now the most popular way to model text in machine learning based sentiment classification. However, the performance of such approach sometimes remains rather limited due to some fundamental deficiencies of the BOW model. In this paper, we focus on the polarity shift problem, and propose a novel approach, called dual training and dual prediction (DTDP), to address it. The basic idea of DTDP is to first generate artificial samples that are polarity-opposite to the original samples by polarity reversion, and then leverage both the original and opposite samples for (dual) training and (dual) prediction. Experimental results on four datasets demonstrate the effectiveness of the proposed approach for polarity classification. 1

6 0.47958148 234 acl-2013-Linking and Extending an Open Multilingual Wordnet

7 0.47812164 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

8 0.4740904 318 acl-2013-Sentiment Relevance

9 0.47392488 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

10 0.47366539 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting

11 0.47060597 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

12 0.47040865 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction

13 0.46977341 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

14 0.4696272 97 acl-2013-Cross-lingual Projections between Languages from Different Families

15 0.46871114 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

16 0.46828732 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

17 0.46789092 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

18 0.46639323 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain

19 0.46636415 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

20 0.46633577 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing