acl acl2013 acl2013-234 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis Bond ; Ryan Foster
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. [sent-2, score-0.93]
2 It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. [sent-3, score-0.481]
3 Overall there are over 2 million senses for over 100 thousand concepts, linking over 1. [sent-4, score-0.193]
4 One of the many attractions of the semantic network WordNet (Fellbaum, 1998), is that there are numerous wordnets being built for different languages. [sent-9, score-0.425]
5 Although there are over 60 languages for which wordnets exist in some state of development (Fellbaum and Vossen, 2012, 3 16), less than half of these have released any data, and for those that have, the data is often not freely accessible (Bond and Paik, 2012). [sent-13, score-0.576]
6 For those wordnets that are available, they are of widely varying size and quality, both in terms of accuracy and richness. [sent-14, score-0.425]
7 com languages to be able to access wordnets for those languages with a minimum of legal and technical barriers. [sent-17, score-0.574]
8 In practice this means making it possible to access multiple wordnets with a common interface. [sent-18, score-0.425]
9 We also use sources of semi-structured data that have minimal legal restrictions to automatically extend existing freely available wordnets and to create additional wordnets which can be added to our open wordnet grid. [sent-19, score-1.278]
10 Previous studies have leveraged multiple wordnets and Wiktionary (Wikimedia, 2013) to extend existing wordnets or create new ones (de Melo and Weikum, 2009; Hanoka and Sagot, 2012). [sent-20, score-0.85]
11 These studies passed over the valuable sense groupings of translations within Wiktionary and merely used Wiktionary as a source of translations that were not disambiguated according to sense. [sent-21, score-0.205]
12 The present study built and extended wordnets by directly linking Wiktionary senses to WordNet senses. [sent-22, score-0.618]
13 Meyer and Gurevych (201 1) demonstrated the ability to automatically identify many matching senses in Wiktionary and WordNet based on the similarity of monolingual features. [sent-23, score-0.131]
14 Other large scale multilingual lexicons have been made by linking wordnet to Wikipedia (Wikipedia, 2013; de Melo and Weikum, 2010; Navigli and Ponzetto, 2012). [sent-26, score-0.461]
15 In Section 2 we discuss linking freely available wordnets to form a single multilingual semantic network. [sent-28, score-0.563]
16 In Section 3 we extend the wordnets with data from two sources. [sent-29, score-0.425]
17 2 Linking Multiple Wordnets In order to make the data from existing wordnet projects more accessible, we have built a simple database with information from those wordnets with licenses that allow redistribution of the data. [sent-33, score-1.051]
18 These wordnets, their licenses and recent activity are summarized in Table 1 (sizes for most of them are shown in Table 2). [sent-34, score-0.193]
19 jp/~kuribayashi/multi/ Table 1: Linked Open Wordnets The first wordnet developed is the Princeton WordNet (PWN: Fellbaum, 1998). [sent-36, score-0.323]
20 PWN is released under an open license (allowing one to use, copy, modify and distribute it so long as you prop- erly acknowledge the copyright). [sent-40, score-0.195]
21 The majority of freely available wordnets take the basic structure of the PWN and add new lemmas (words) to the existing synsets: the extend model (Vossen, 2005). [sent-41, score-0.487]
22 For example, dogn:1 is linked to the lemmas chien in French, anjing in Malay, and so on. [sent-42, score-0.138]
23 In theory, such wordnets can easily be combined into a single resource by using the PWN synsets as pivots. [sent-47, score-0.554]
24 Because they are linked at the synset level, the problem of ambiguity one gets when linking bilingual dictionaries through a common language is resolved: we are linking senses to senses. [sent-49, score-0.371]
25 In practice, linking a new language’s wordnet into the grid could be problematic for three reasons. [sent-50, score-0.421]
26 The first problem was that the wordnets were linked to various versions of the Princeton WordNet. [sent-51, score-0.501]
27 The second problem was the incredible variety of formats that the wordnets are distributed in. [sent-53, score-0.425]
28 The final problem was legal: not all wordnets have been released under licenses that allow reuse. [sent-56, score-0.679]
29 Mapping introduces some distortions, in particular, when a synset is split, we chose to only map the translations to the most probable mapping, so some new synsets will have no translations. [sent-59, score-0.251]
30 Any problems or bugs found when converting the wordnets have been reported back to the original projects, with many of them fixed in newer releases. [sent-62, score-0.425]
31 The third, legal, problem is being solved by an ongoing campaign to encourage projects to (re-)release their data under open licenses. [sent-64, score-0.134]
32 Since Bond and Paik (2012) surveyed wordnet licenses in 2011, six projects have newly released data un1353 der open licenses and eight projects have updated their data. [sent-65, score-0.982]
33 Our combined wordnet includes English (Fellbaum, 1998); Albanian (Ruci, 2008); Arabic (Black et al. [sent-66, score-0.323]
34 On our server, the wordnets are all in a shared sqlite database using the schema produced by the Japanese WordNet project (Isahara et al. [sent-79, score-0.457]
35 The Scandinavian and Polish wordnets are based on the merge approach, where indepen- dent language specific structures are built and then some synsets linked to PWN. [sent-84, score-0.63]
36 (2006) created a list of 5,000 core word senses in Princeton WordNet which represent approximately the 5,000 most frequently used word senses. [sent-88, score-0.179]
37 As a very rough measure of useful coverage, we report the percentage of synsets covered from this core list. [sent-90, score-0.177]
38 Note that some wordnet projects have deliberately targeted the core concepts, which of course boosts their coverage scores. [sent-92, score-0.449]
39 2 License Types The licenses fall into four broad categories: (u) completely unrestricted, (a) attribution required, (s) share alike, and (n) non-commercial. [sent-98, score-0.193]
40 The WordNet, MIT, and CC BY licenses are all in this category. [sent-101, score-0.193]
41 The third category allows anyone to adapt and improve the licensed work and redistribute it, but the redistributed work must be released under the same license. [sent-102, score-0.132]
42 The CC BY-SA, GPL, GFDL, and CeCILL-C licenses are of this type. [sent-103, score-0.193]
43 Because derivative works can only be redistributed under the same license, works licensed under any two of these licenses cannot be combined with each other and legally redistributed. [sent-104, score-0.193]
44 The CC BY-NC and the CC BY-NC-SA licenses are in this category, they are also incompatible with licenses in category (s). [sent-108, score-0.386]
45 Releasing a work under the more restrictive licenses in categories (s) and (n) above substantially limit and complicate the ability to extend and combine a work into other useful forms. [sent-109, score-0.193]
46 We can currently combine those with licenses in groups (u) and (a) and the CC BYSA wordnets (now everything except French and Basque). [sent-112, score-0.618]
47 4 This is a collection of data maintained by the Unicode Consortium to support software internationalization and localization with locale information on formatting dates, numbers, currencies, times, and time zones, as well as help for choosing languages and countries by name. [sent-121, score-0.19]
48 It is released under an open license that allows redistribution with proper attribution (Unicode, Inc. [sent-123, score-0.195]
49 Most had around 550 senses (synsets and their lemmas): for example, for Portuguese: Englishn:1 ingl eˆs. [sent-126, score-0.131]
50 de Melo and Weikum (2009) also use this data (and data from a variety of other sources) to build an enhanced wordnet, in addition adding new synsets for concepts that are not in not wordnet. [sent-133, score-0.192]
51 7 The current version of the parser is capable of extracting headwords, parts of speech, definitions, synonyms and translations from the XML Wiktionary database dumps provided by the Wikimedia Foundation. [sent-151, score-0.143]
52 Within the English Wiktionary, synonyms and translations are both grouped into sense groups that correspond with definitions in the main section. [sent-154, score-0.218]
53 These sense groups are marked by a short text gloss (short gloss), which is usually an abbreviated version of one of the full definitions (full definition). [sent-155, score-0.229]
54 org/ 1355 • • Finnish: {{t+ |f i |sanakirj a}} French: {{t+ |fr |dictionnaire |m}} To enable later processing, it is necessary to tie synonyms and translations to their corresponding short gloss via a unique key. [sent-165, score-0.233]
55 Once a link is established between this defkey and a particular synset, translations added to Wiktionary at a later data can be automatically integrated into our multilingual wordnet. [sent-171, score-0.188]
56 Conversely, if a Wiktionary contributor changes a short gloss, historical data connected to the old defkey is preserved while new data imported at a later time will not be incorrectly linked to an older definition. [sent-172, score-0.141]
57 In our study we evaluated the potential for aligning senses based on common translations in combination with monolingual similarity features. [sent-178, score-0.213]
58 In this study we used 20 of the wordnets described in Section 2,9 and the Wiktionary data obtained using the parser described in Section 3. [sent-179, score-0.425]
59 First, article headwords were included as English translations of Wiktionary senses (along with synonyms). [sent-183, score-0.213]
60 4 million sense translations in 20 languages in our wordnet grid, and nearly 1. [sent-186, score-0.496]
61 We then created a list of all possible alignments where at least one translation of a wordnet sense matched a translation of a Wiktionary sense. [sent-188, score-0.364]
62 This represented a small percentage of the possible alignments, because definitions in Wiktionary that do not contain any translations were ignored in our study. [sent-189, score-0.148]
63 9We didn’t use Chinese or Polish, as the wordnets were added after we had started the evaluation. [sent-205, score-0.425]
64 p|αL(|Lsk()s∩k)∪L(Ls(ns)|n)| (3) = kpBBooWW((wwnnddeeff))k·kBBoWoW((wwkkddeeff))k (4) = simt simt(sn,sk) = simd(sn,sk) simc (sn, sk) +β simc (5) simt gives higher weight to concepts that link through more lemmas, not just a higher proportion of lemmas. [sent-208, score-0.345]
65 simd measures the similarity of the definitions in the two resources, using a cosine similarity score. [sent-209, score-0.14]
66 We initially used the WordNet gloss and example sentence(s) for wndef and the short gloss from Wiktionary for wkdef. [sent-210, score-0.239]
67 This improved the accuracy of the combined ranking score (simc), but since many of the short glosses are only one or two words, the sparse input often produced a simd score of zero even when the candidate alignment was correct. [sent-211, score-0.144]
68 We expect that an improved alignment of short glosses to full definitions together with more accurate measures of lexical similarity such as described by Meyer and Gurevych (201 1) would further improve the accuracy of a combined ranking score. [sent-222, score-0.136]
69 4 Results and Evaluation We give the data for the 26 wordnets with more than 10,000 synsets in Table 2. [sent-243, score-0.554]
70 Individual totals are shown for synsets and senses from the original wordnets, the data extracted from Wiktionary, and the merged data of the wordnets, Wiktionary and CLDR. [sent-245, score-0.26]
71 Overall there are 2,040,805 senses for 117,659 concepts, using over 1,400,000 words in over 1,000 languages. [sent-247, score-0.131]
72 The bigger wordnets show the data from Wiktionary (and to a lesser extent CLDR) having only a small increase in the number of senses. [sent-250, score-0.425]
73 Major languages such as German or Russian, which currently do not have open wordnets get good coverage as well. [sent-252, score-0.531]
74 The size of the mapping table is the same as the number of English senses linked (49,951 senses). [sent-253, score-0.235]
75 We evaluated a random sample of 160 alignments and found the accuracy to be 90% (Wiktionary sense maps to the best possible wordnet sense). [sent-254, score-0.364]
76 We then evaluated samples of the wordnet created from Wiktionary for several languages. [sent-255, score-0.323]
77 The sense accuracy is higher than the mapping accuracy: in general, entries with more translations are linked more accurately, thus raising the average precision. [sent-259, score-0.227]
78 During the extraction and eval11For Chinese we use the wordnet from Xu et al. [sent-260, score-0.323]
79 0 Table 3: Precision of Wiktionary-based Wordnets ∗ Not used to build the mapping from wordnet to Wiktionary. [sent-264, score-0.351]
80 Each language has up to three files: the data from the wordnet project (if it exists), the data from the CLDR and the data from Wiktionary. [sent-270, score-0.323]
81 They are kept separate in order 1358 to keep the licenses as free as possible. [sent-271, score-0.193]
82 5 Discussion and Future Work We have created a large open wordnet of high quality (85%–99% measured on senses). [sent-278, score-0.379]
83 Twenty six languages have more than 10,000 concepts covered, with 42–100% coverage of the most common core concepts. [sent-279, score-0.161]
84 The overall accuracy is estimated at over 94%, as most of the original wordnets are hand verified (and so should be 100% accurate). [sent-281, score-0.425]
85 The high accuracy is largely thanks to the disambiguating power of the multiple translations, made possible by the many open wordnets we have access to. [sent-282, score-0.481]
86 Because we link senses between wordnet and Wiktionary and then use the translations of the sense, manually validating this mapping will improve the entries in multiple languages simultaneously. [sent-283, score-0.614]
87 Because we use a different approach, it would be possible to merge the two if the licenses allowed us to. [sent-289, score-0.193]
88 However, since the CC BY-SA and CC-BYNC-SA licenses are mutually exclusive, the two works cannot be combined and rereleased unless relevant parties can relicense the works. [sent-290, score-0.193]
89 Improvements in either the wordnet projects or Wiktionary (or both) can also result in improved mappings. [sent-296, score-0.401]
90 We further hope to take advantage of ongoing initiatives in the global wordnet grid to add new concepts not in the Princeton WordNet, so that we can expand beyond an English-centered world view. [sent-297, score-0.422]
91 In particular, we make the data easily accessible to the original wordnet projects, some of whom have already started to merge it into their own resources. [sent-299, score-0.363]
92 We cannot check the accuracy of data in all languages, nor, for example, check that synsets have the most appropriate lemmas associated with them. [sent-300, score-0.191]
93 This kind of language specific quality control is best done by the individual wordnet projects. [sent-303, score-0.323]
94 It should be possible to expand the multilingual wordnet in the same way using Wiktionaries in other languages, which we would expect to improve coverage. [sent-311, score-0.399]
95 We consider it important that our wordnet is not just large and accurate but also maintainable and as accessible as possible. [sent-318, score-0.363]
96 6 Conclusions We have created an open multilingual wordnet with over 26 languages. [sent-319, score-0.455]
97 It is made by combining wordnets with open licences, data from the Unicode Common Locale Data Repository and Wiktionary. [sent-320, score-0.481]
98 Towards a universal wordnet by learning from combined evidence. [sent-364, score-0.323]
99 What psycholinguists know about chemistry: Aligning wiktionary and wordnet for increased domain coverage. [sent-436, score-0.842]
100 DanNet the challenge of compiling a wordnet for Danish by reusing a monolingual dictionary. [sent-467, score-0.323]
wordName wordTfidf (topN-words)
[('wiktionary', 0.519), ('wordnets', 0.425), ('wordnet', 0.323), ('licenses', 0.193), ('senses', 0.131), ('synsets', 0.129), ('cc', 0.123), ('cldr', 0.119), ('locale', 0.104), ('unicode', 0.104), ('bond', 0.097), ('pwn', 0.092), ('simc', 0.089), ('gloss', 0.087), ('vossen', 0.083), ('translations', 0.082), ('projects', 0.078), ('license', 0.078), ('multilingual', 0.076), ('linked', 0.076), ('simd', 0.074), ('definitions', 0.066), ('concepts', 0.063), ('linking', 0.062), ('lemmas', 0.062), ('melo', 0.062), ('released', 0.061), ('princeton', 0.061), ('paik', 0.059), ('open', 0.056), ('francis', 0.055), ('simt', 0.052), ('sima', 0.052), ('fellbaum', 0.05), ('languages', 0.05), ('japanese', 0.049), ('legal', 0.049), ('gwc', 0.048), ('persian', 0.048), ('core', 0.048), ('thai', 0.046), ('charoenporn', 0.044), ('hirfana', 0.044), ('nurril', 0.044), ('sime', 0.044), ('piek', 0.041), ('sense', 0.041), ('accessible', 0.04), ('synset', 0.04), ('mccrae', 0.039), ('redistribute', 0.039), ('uwn', 0.039), ('weikum', 0.037), ('meyer', 0.037), ('polish', 0.037), ('formatting', 0.036), ('indonesian', 0.036), ('pianta', 0.036), ('euro', 0.036), ('grid', 0.036), ('glosses', 0.035), ('repository', 0.035), ('short', 0.035), ('hitoshi', 0.035), ('sagot', 0.034), ('norwegian', 0.034), ('isahara', 0.034), ('anyone', 0.032), ('database', 0.032), ('url', 0.031), ('basque', 0.03), ('french', 0.03), ('format', 0.03), ('balkanet', 0.03), ('chumpol', 0.03), ('daude', 0.03), ('defkey', 0.03), ('dogn', 0.03), ('eus', 0.03), ('gfdl', 0.03), ('glg', 0.03), ('hanoka', 0.03), ('jpn', 0.03), ('kyonghee', 0.03), ('licences', 0.03), ('mokarat', 0.03), ('montazery', 0.03), ('quah', 0.03), ('sskk', 0.03), ('ssnn', 0.03), ('thatsanee', 0.03), ('thoongsup', 0.03), ('virach', 0.03), ('wndef', 0.03), ('wordneto', 0.03), ('wordnetu', 0.03), ('zsm', 0.03), ('synonyms', 0.029), ('arabic', 0.029), ('mapping', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
Author: Francis Bond ; Ryan Foster
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
2 0.34295231 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
Author: Silvana Hartmann ; Iryna Gurevych
Abstract: We present a new bilingual FrameNet lexicon for English and German. It is created through a simple, but powerful approach to construct a FrameNet in any language using Wiktionary as an interlingual representation. Our approach is based on a sense alignment of FrameNet and Wiktionary, and subsequent translation disambiguation into the target language. We perform a detailed evaluation of the created resource and a discussion of Wiktionary as an interlingual connection for the cross-language transfer of lexicalsemantic resources. The created resource is publicly available at http : / /www . ukp .tu-darmst adt .de / fnwkde / .
3 0.17848065 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages
Author: Brijesh Bhatt ; Lahari Poddar ; Pushpak Bhattacharyya
Abstract: We present IndoNet, a multilingual lexical knowledge base for Indian languages. It is a linked structure of wordnets of 18 different Indian languages, Universal Word dictionary and the Suggested Upper Merged Ontology (SUMO). We discuss various benefits of the network and challenges involved in the development. The system is encoded in Lexical Markup Framework (LMF) and we propose modifications in LMF to accommodate Universal Word Dictionary and SUMO. This standardized version of lexical knowledge base of Indian Languages can now easily , be linked to similar global resources.
4 0.16813271 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
Author: Mohammad Taher Pilehvar ; David Jurgens ; Roberto Navigli
Abstract: Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: seman- tic textual similarity, word similarity, and word sense coarsening.
5 0.11547811 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
Author: Antske Fokkens ; Marieke van Erp ; Marten Postma ; Ted Pedersen ; Piek Vossen ; Nuno Freire
Abstract: Repeating experiments is an important instrument in the scientific toolbox to validate previous work and build upon existing work. We present two concrete use cases involving key techniques in the NLP domain for which we show that reproducing results is still difficult. We show that the deviation that can be found in reproduction efforts leads to questions about how our results should be interpreted. Moreover, investigating these deviations provides new insights and a deeper understanding of the examined techniques. We identify five aspects that can influence the outcomes of experiments that are typically not addressed in research papers. Our use cases show that these aspects may change the answer to research questions leading us to conclude that more care should be taken in interpreting our results and more research involving systematic testing of methods is required in our field.
6 0.11063398 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments
7 0.10845632 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates
8 0.10579256 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis
9 0.087576136 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
10 0.085885249 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context
11 0.082394943 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
12 0.074200168 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
13 0.074094959 366 acl-2013-Understanding Verbs based on Overlapping Verbs Senses
14 0.073369764 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner
15 0.066833287 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models
16 0.061590038 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD
17 0.0614691 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain
18 0.061039079 116 acl-2013-Detecting Metaphor by Contextual Analogy
19 0.05960374 290 acl-2013-Question Analysis for Polish Question Answering
20 0.059415646 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing
topicId topicWeight
[(0, 0.161), (1, 0.023), (2, 0.047), (3, -0.107), (4, -0.036), (5, -0.13), (6, -0.16), (7, 0.055), (8, 0.124), (9, -0.079), (10, -0.003), (11, 0.02), (12, -0.114), (13, -0.016), (14, 0.1), (15, -0.012), (16, 0.048), (17, 0.015), (18, -0.024), (19, -0.01), (20, -0.05), (21, -0.043), (22, -0.03), (23, -0.022), (24, 0.05), (25, -0.074), (26, -0.035), (27, -0.058), (28, 0.065), (29, 0.0), (30, 0.054), (31, 0.033), (32, 0.057), (33, 0.026), (34, 0.001), (35, -0.115), (36, -0.012), (37, 0.047), (38, 0.102), (39, -0.001), (40, -0.057), (41, 0.021), (42, -0.031), (43, 0.02), (44, -0.048), (45, -0.01), (46, -0.064), (47, -0.135), (48, -0.046), (49, 0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.95167381 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
Author: Francis Bond ; Ryan Foster
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
2 0.86557192 162 acl-2013-FrameNet on the Way to Babel: Creating a Bilingual FrameNet Using Wiktionary as Interlingual Connection
Author: Silvana Hartmann ; Iryna Gurevych
Abstract: We present a new bilingual FrameNet lexicon for English and German. It is created through a simple, but powerful approach to construct a FrameNet in any language using Wiktionary as an interlingual representation. Our approach is based on a sense alignment of FrameNet and Wiktionary, and subsequent translation disambiguation into the target language. We perform a detailed evaluation of the created resource and a discussion of Wiktionary as an interlingual connection for the cross-language transfer of lexicalsemantic resources. The created resource is publicly available at http : / /www . ukp .tu-darmst adt .de / fnwkde / .
3 0.82400894 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages
Author: Brijesh Bhatt ; Lahari Poddar ; Pushpak Bhattacharyya
Abstract: We present IndoNet, a multilingual lexical knowledge base for Indian languages. It is a linked structure of wordnets of 18 different Indian languages, Universal Word dictionary and the Suggested Upper Merged Ontology (SUMO). We discuss various benefits of the network and challenges involved in the development. The system is encoded in Lexical Markup Framework (LMF) and we propose modifications in LMF to accommodate Universal Word Dictionary and SUMO. This standardized version of lexical knowledge base of Indian Languages can now easily , be linked to similar global resources.
4 0.65514761 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
Author: Mohammad Taher Pilehvar ; David Jurgens ; Roberto Navigli
Abstract: Semantic similarity is an essential component of many Natural Language Processing applications. However, prior methods for computing semantic similarity often operate at different levels, e.g., single words or entire documents, which requires adapting the method for each data type. We present a unified approach to semantic similarity that operates at multiple levels, all the way from comparing word senses to comparing text documents. Our method leverages a common probabilistic representation over word senses in order to compare different types of linguistic data. This unified representation shows state-ofthe-art performance on three tasks: seman- tic textual similarity, word similarity, and word sense coarsening.
5 0.63313538 258 acl-2013-Neighbors Help: Bilingual Unsupervised WSD Using Context
Author: Sudha Bhingardive ; Samiulla Shaikh ; Pushpak Bhattacharyya
Abstract: Word Sense Disambiguation (WSD) is one of the toughest problems in NLP, and in WSD, verb disambiguation has proved to be extremely difficult, because of high degree of polysemy, too fine grained senses, absence of deep verb hierarchy and low inter annotator agreement in verb sense annotation. Unsupervised WSD has received widespread attention, but has performed poorly, specially on verbs. Recently an unsupervised bilingual EM based algorithm has been proposed, which makes use only of the raw counts of the translations in comparable corpora (Marathi and Hindi). But the performance of this approach is poor on verbs with accuracy level at 25-38%. We suggest a modifica- tion to this mentioned formulation, using context and semantic relatedness of neighboring words. An improvement of 17% 35% in the accuracy of verb WSD is obtained compared to the existing EM based approach. On a general note, the work can be looked upon as contributing to the framework of unsupervised WSD through context aware expectation maximization.
6 0.62453651 366 acl-2013-Understanding Verbs based on Overlapping Verbs Senses
7 0.62162095 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation
8 0.5537712 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
9 0.55352902 53 acl-2013-Annotation of regular polysemy and underspecification
10 0.54927349 96 acl-2013-Creating Similarity: Lateral Thinking for Vertical Similarity Judgments
11 0.54131567 111 acl-2013-Density Maximization in Context-Sense Metric Space for All-words WSD
12 0.53384513 344 acl-2013-The Effects of Lexical Resource Quality on Preference Violation Detection
13 0.49929896 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates
14 0.49112993 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
15 0.4699353 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain
16 0.46220437 62 acl-2013-Automatic Term Ambiguity Detection
17 0.44672993 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
18 0.43909958 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
19 0.43286562 116 acl-2013-Detecting Metaphor by Contextual Analogy
20 0.43226406 3 acl-2013-A Comparison of Techniques to Automatically Identify Complex Words.
topicId topicWeight
[(0, 0.076), (6, 0.023), (11, 0.049), (15, 0.019), (24, 0.044), (26, 0.07), (31, 0.285), (35, 0.069), (42, 0.047), (48, 0.052), (70, 0.031), (88, 0.03), (90, 0.016), (95, 0.098)]
simIndex simValue paperId paperTitle
1 0.86215639 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars
Author: Yoav Artzi ; Nicholas FitzGerald ; Luke Zettlemoyer
Abstract: unkown-abstract
2 0.86197937 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics
Author: Johann-Mattis List ; Steven Moran
Abstract: Given the increasing interest and development of computational and quantitative methods in historical linguistics, it is important that scholars have a basis for documenting, testing, evaluating, and sharing complex workflows. We present a novel open-source toolkit for quantitative tasks in historical linguistics that offers these features. This toolkit also serves as an interface between existing software packages and frequently used data formats, and it provides implementations of new and existing algorithms within a homogeneous framework. We illustrate the toolkit’s functionality with an exemplary workflow that starts with raw language data and ends with automatically calculated phonetic alignments, cognates and borrowings. We then illustrate evaluation metrics on gold standard datasets that are provided with the toolkit.
same-paper 3 0.79510695 234 acl-2013-Linking and Extending an Open Multilingual Wordnet
Author: Francis Bond ; Ryan Foster
Abstract: We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
4 0.75182545 367 acl-2013-Universal Conceptual Cognitive Annotation (UCCA)
Author: Omri Abend ; Ari Rappoport
Abstract: Syntactic structures, by their nature, reflect first and foremost the formal constructions used for expressing meanings. This renders them sensitive to formal variation both within and across languages, and limits their value to semantic applications. We present UCCA, a novel multi-layered framework for semantic representation that aims to accommodate the semantic distinctions expressed through linguistic utterances. We demonstrate UCCA’s portability across domains and languages, and its relative insensitivity to meaning-preserving syntactic variation. We also show that UCCA can be effectively and quickly learned by annotators with no linguistic background, and describe the compilation of a UCCAannotated corpus.
5 0.6860866 211 acl-2013-LABR: A Large Scale Arabic Book Reviews Dataset
Author: Mohamed Aly ; Amir Atiya
Abstract: We introduce LABR, the largest sentiment analysis dataset to-date for the Arabic language. It consists of over 63,000 book reviews, each rated on a scale of 1 to 5 stars. We investigate the properties of the the dataset, and present its statistics. We explore using the dataset for two tasks: sentiment polarity classification and rating classification. We provide standard splits of the dataset into training and testing, for both polarity and rating classification, in both balanced and unbalanced settings. We run baseline experiments on the dataset to establish a benchmark.
6 0.62323624 374 acl-2013-Using Context Vectors in Improving a Machine Translation System with Bridge Language
7 0.56651568 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages
8 0.54341137 316 acl-2013-SenseSpotting: Never let your parallel data tie you to an old domain
9 0.54230028 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction
10 0.54221594 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
11 0.54025257 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification
12 0.54005587 267 acl-2013-PARMA: A Predicate Argument Aligner
13 0.5395667 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
14 0.53922933 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction
15 0.53910351 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions
16 0.53893101 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting
17 0.53837293 97 acl-2013-Cross-lingual Projections between Languages from Different Families
18 0.53748769 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner
19 0.53661358 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction
20 0.53578103 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks