acl acl2011 acl2011-229 knowledge-graph by maker-knowledge-mining

229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon


Source: pdf

Author: Clifton McFate ; Kenneth Forbus

Abstract: Broad coverage lexicons for the English language have traditionally been handmade. This approach, while accurate, requires too much human labor. Furthermore, resources contain gaps in coverage, contain specific types of information, or are incompatible with other resources. We believe that the state of open-license technology is such that a comprehensive syntactic lexicon can be automatically compiled. This paper describes the creation of such a lexicon, NU-LEX, an open-license feature-based lexicon for general purpose parsing that combines WordNet, VerbNet, and Wiktionary and contains over 100,000 words. NU-LEX was integrated into a bottom up chart parser. We ran the parser through three sets of sentences, 50 sentences total, from the Simple English Wikipedia and compared its performance to the same parser using Comlex. Both parsers performed almost equally with NU-LEX finding all lex-items for 50% of the sentences and Comlex succeeding for 52%. Furthermore, NULEX’s shortcomings primarily fell into two categories, suggesting future research directions. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Broad coverage lexicons for the English language have traditionally been handmade. [sent-5, score-0.057]

2 Furthermore, resources contain gaps in coverage, contain specific types of information, or are incompatible with other resources. [sent-7, score-0.08]

3 We believe that the state of open-license technology is such that a comprehensive syntactic lexicon can be automatically compiled. [sent-8, score-0.129]

4 This paper describes the creation of such a lexicon, NU-LEX, an open-license feature-based lexicon for general purpose parsing that combines WordNet, VerbNet, and Wiktionary and contains over 100,000 words. [sent-9, score-0.096]

5 NU-LEX was integrated into a bottom up chart parser. [sent-10, score-0.036]

6 We ran the parser through three sets of sentences, 50 sentences total, from the Simple English Wikipedia and compared its performance to the same parser using Comlex. [sent-11, score-0.156]

7 Both parsers performed almost equally with NU-LEX finding all lex-items for 50% of the sentences and Comlex succeeding for 52%. [sent-12, score-0.072]

8 1 Introduction While there are many types of parsers available, all of them rely on a lexicon of words, whether syntactic like Comlex, enriched with semantics like WordNet, or derived from tagged corpora like the Penn Treebank (Macleod et al, 1994; Fellbaum, 1998; Marcus et al, 19933)6. [sent-14, score-0.194]

9 edu However, many of these resources have gaps that the others can fill in. [sent-18, score-0.08]

10 WordNet, for example, only contains open-class words, and it lacks the extensive subcategorization frame and agreement information present in Comlex (Miller et al, 1993; Macleod et al, 1994). [sent-19, score-0.152]

11 Furthermore, many of these resources do not map to one another or have restricted licenses. [sent-21, score-0.08]

12 The goal of our research was to create a syntactic lexicon, like Comlex, that unified multiple existing open-source resources including Felbaum’s (1998) WordNet, Kipper et al’s (2000) VerbNet, and Wiktionary. [sent-22, score-0.151]

13 Furthermore, we wanted it to have direct links to frame semantic representations via the openlicense OpenCyc knowledge base. [sent-23, score-0.112]

14 The result was NU-LEX a lexicon of over 100,000 words that has the coverage of WordNet, is enriched with tense information from automatically screen-scrapping Wiktionary1, and contains VerbNet subcategorization frames. [sent-24, score-0.311]

15 This lexicon was incorporated into a bottom-up chart parser, EANLU, that connects the words to Cyc representations (Tomai & Forbus 2009). [sent-25, score-0.164]

16 Each entry is represented by Cyc assertions and contains syntactic information as a set of features consistent with previous feature systems (Allen 1995; Macleod et al, 1994). [sent-26, score-0.033]

17 It represents words in feature value lists that contain lexical data such as part of speech, agreement information, and syntactic frame participation (Macleod et al, 1994). [sent-32, score-0.071]

18 Furthermore, Comlex has extensive mappings to, and uses representations compatible with, multiple lexical resources (Macleod et al, 1994). [sent-33, score-0.143]

19 Attempts to automatically create syntactic lexical resources from tagged corpora have also been successful. [sent-34, score-0.113]

20 These resources have been successfully incorporated into statistical parsers such as the Apple Pie parser (Sekine & Grishman, 1995). [sent-36, score-0.173]

21 Unfortunately, they still require extensive labor to do the annotations. [sent-37, score-0.031]

22 NU-LEX is different in that it is automatically compiled without relying on a hand-annotated corpus. [sent-38, score-0.042]

23 Instead, it combines crowd-sourced data, Wiktionary, with existing lexical resources. [sent-39, score-0.038]

24 This research was possible because of the existing lexical resources WordNet and VerbNet. [sent-40, score-0.118]

25 WordNet is a virtual thesaurus that groups words together by semantic similarity into synsets representing a lexical concept (Felbaum, 1998). [sent-41, score-0.033]

26 VerbNet is an extension of Levin’s (1993) verb class research. [sent-42, score-0.093]

27 It represents verb meaning in a class hierarchy where each verb in a class has similar semantic meanings and identical syntactic usages (Kipper et al, 2000). [sent-43, score-0.219]

28 These two resources have already been mapped, which facilitated applying subcategorization frames to WordNet verbs. [sent-45, score-0.222]

29 OpenCyc is an open-source version of the ResearchCyc knowledge base that contains hierarchical definitional information but is missing much of the lower level instantiated facts and linguistic knowledge of ResearchCyc (Matuszek et al, 2006). [sent-47, score-0.139]

30 Previous research by McFate (2010) used these links and VerbNet hierarchies to create verb semantic frames which are used in EANLU, the parser NU-LEX was tested on. [sent-48, score-0.211]

31 1 Nouns Noun lemmas were initially taken from Fellbaum’s (1998) WordNet index. [sent-54, score-0.066]

32 Each Lemma was then queried in Wiktionary to retrieve its plural form resulting in a triple of word, POS, and plural form: (boat Noun ( ( "plural " "boats " ) ) ) This was used to create a definition for each form. [sent-55, score-0.126]

33 ( de finitionInDi ctionary WordNet (boat ( noun ( synset ( "boat% 1: 0 6 : 0 1: ” ”boat% 1: 0 6 : 0 0 : : " ) ) ( orth "boat " ) ( countable + ) ( root boat ) ( agr 3 s ) ) ) ) 3. [sent-57, score-0.508]

34 2 "Boat " Verbs Like Nouns, verb base lemmas were taken from the WordNet index. [sent-58, score-0.159]

35 The subcategorization for a verb frames were taken directly from VerbNet. [sent-60, score-0.267]

36 ( de finitionInDi ctionary WordNet " Give " ( give (verb ( synset ( " give% 2 : 4 1: 1 : : 0 " give%2 : 3 4 : 0 0 : : " ) ) ( orth " give " ) (vform pre s ) ( subcat ( ? [sent-62, score-0.351]

37 3 Adjectives and Adverbs Adjectives and adverbs were simply taken from WordNet. [sent-71, score-0.076]

38 No information from Wiktionary was added for this version of NU-LEX, so it does not include comparative or superlative forms. [sent-72, score-0.03]

39 This will be added in future iterations by using Wiktionary. [sent-73, score-0.03]

40 The lack of comparatives and superlatives caused no errors. [sent-74, score-0.156]

41 Each definition contains the Word, POS, and Synset list: ( de finit ion InDi ctionary WordNet " Funny" ( funny ( adj ective ( root funny) ( orth " funny" ) ( s ynset ( " funny% 4 : 0 2 : 0 1: : " " funny% 4 : 0 2 : 0 0 : : " ) ) ) ) ) 3. [sent-75, score-0.355]

42 Likewise, Be-verbs had to be manually added as the Wiktionary page proved too difficult to parse. [sent-78, score-0.03]

43 Notably, proper names and cardinal numbers are missing from NU-LEX. [sent-80, score-0.424]

44 4 Experiment Setup The sample sentences consisted of 50 samples from the Simple English Wikipedia2 articles on the heart, lungs, and George Washington. [sent-83, score-0.09]

45 The heart set consisted of the first 25 sentences of the article, not counting parentheticals. [sent-84, score-0.221]

46 The lungs set consisted of the first 13 sentences of the article. [sent-85, score-0.205]

47 The George Washington set consisted of the first 12 sentences of that article. [sent-86, score-0.09]

48 There were 239 unique words in the whole set out of 599 words total. [sent-88, score-0.036]

49 EANLU is a bottom-up chart parser that uses compositional semantics to translate natural language into Cyc predicate calculus representations (Tomai & Forbus 2009). [sent-90, score-0.127]

50 Each sentence was evaluated as correct based on whether or not it returned the proper word forms. [sent-95, score-0.114]

51 Failure occurred if any lex-item was not retrieved or if the parser was unable to parse the sentence due to system memory constraints. [sent-97, score-0.099]

52 5 Results Can NU-LEX perform comparably to existing syntactic resources despite being automatically compiled from multiple resources? [sent-98, score-0.193]

53 In particular we wanted to uncover words that disappeared or were represented incorrectly as a result of the screen-scraping process. [sent-101, score-0.042]

54 NULEX got 25 out of 50 (50%) correct and Comlex got 26 out of 50 (52%) of the sentences correct. [sent-103, score-0.15]

55 The two systems made many of the same errors, and a primary source of errors was the lack of proper nouns in either resource. [sent-104, score-0.284]

56 Proper nouns caused seven sentences to fail in both parsers or 29% of total errors. [sent-105, score-0.285]

57 Of the NU-LEX failures not caused by proper nouns, five of them (20%) were caused by lacking cardinal numbers. [sent-106, score-0.56]

58 The rest were due to missing lex-items across several categories. [sent-107, score-0.139]

59 Comlex primarily failed due to missing medical terminology in the lungs and heart test set. [sent-108, score-0.52]

60 Out of the total 239 unique words, NULEX failed on 11 unique words not counting proper nouns or cardinal numbers. [sent-109, score-0.564]

61 One additional failure was due to the missing pronoun “themselves ” which was retroactively added to the hand created pronoun section. [sent-110, score-0.226]

62 Comlex failed on 6 unique words, not counting proper nouns, giving it a failure rate of 2. [sent-113, score-0.35]

63 1 The Heart For the heart set 25 sentences were run through the parser. [sent-116, score-0.134]

64 Using NU-LEX, the system correctly identified the lex-items for 17 out of 25 sentences (68%). [sent-117, score-0.038]

65 Of the sentences it did not get correct, five were incorrect only because of the lack of cardinal number representation. [sent-118, score-0.216]

66 Using Comlex, the parser correctly identified all lex-items for 16 out of 25 sentences (64%). [sent-120, score-0.097]

67 The sentences it got wrong all failed because of missing medical terms. [sent-121, score-0.403]

68 In particular, atrium and vena cava caused lexical errors. [sent-122, score-0.118]

69 2 The Lungs For the lung set 13 sentences were run through the parser. [sent-124, score-0.038]

70 Using NU-LEX the system correctly identified all lex-items for 6 out of 13 sentences (46%). [sent-125, score-0.038]

71 Two errors were caused by the lack of cardinal number representation and one sentence failed due to memory constraints. [sent-126, score-0.481]

72 One sentence failed because of the medical specific term parabronchi. [sent-127, score-0.17]

73 Four additional errors were due to a malformed verb definitions and missing lexitems lost during screen scraping. [sent-128, score-0.327]

74 Using Comlex the parser correctly identified all lex-items for 7 out of 13 sentences (53%). [sent-129, score-0.097]

75 Five failures were caused by missing lex-items, namely medical terminology like alveoli and parabronchi. [sent-130, score-0.389]

76 3 George Washington For the George Washington set 12 sentences were run through the parser. [sent-133, score-0.038]

77 This was a set that we expected to cause problems for NU-LEX and Comlex because of the lack of proper noun representation. [sent-134, score-0.181]

78 NU-LEX got only 2 out of 12 correct and seven of these errors were caused by proper nouns such as George Washington. [sent-135, score-0.42]

79 All but one of the Comlex errors was caused by missing proper nouns. [sent-137, score-0.408]

80 6 Discussion NU-LEX is unique in that it is a syntactic lexicon automatically compiled from several open-source resources and a crowd-sourced website. [sent-138, score-0.287]

81 We’ve demonstrated that its performance is on par with existing state of the art resources like Comlex. [sent-140, score-0.118]

82 Because it scrapes Wiktionary for tense information, NU-LEX can constantly evolve to include new forms or corrections. [sent-142, score-0.044]

83 As its coverage (over 100,000 words) is derived from Fellbaum’s (1998) 366 WordNet, it is also significantly larger than existing similar syntactic resources. [sent-143, score-0.128]

84 The majority of errors in the experiments were caused by either missing numbers or missing proper nouns. [sent-146, score-0.578]

85 Cardinal numbers could be easily added to improve performance. [sent-147, score-0.061]

86 Furthermore, solutions to missing numbers could be created on the grammar side of the process. [sent-148, score-0.17]

87 Missing proper nouns represent both a gap and an opportunity. [sent-149, score-0.209]

88 Because the lexicon is Cyc compliant, other options could include querying the Cyc KB for people and then explicitly representing the examples as definitions. [sent-151, score-0.096]

89 With proper noun and number coverage, total failures would have been reduced by 48%. [sent-154, score-0.213]

90 Thus, simple automated additions in the future can greatly enhance performance. [sent-155, score-0.042]

91 Errors caused by missing or malformed definitions were not abundant, showing up in only 12 of the 50 parses and under half of the total errors. [sent-156, score-0.315]

92 Because it is CycL compliant the entire lexicon can be formally represented in the Cyc knowledge base (Matuszek et al, 2006). [sent-160, score-0.154]

93 When partnered with the EANLU parser and McFate’s (2010) OpenCyc verb frames, the result is a semantic parser that uses completely open-license resources. [sent-163, score-0.211]

94 It is our hope that NU-LEX will provide a powerful tool for the natural language community both on its own and combined with existing resources. [sent-164, score-0.038]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('comlex', 0.461), ('cyc', 0.254), ('wiktionary', 0.234), ('eanlu', 0.202), ('wordnet', 0.19), ('boat', 0.178), ('macleod', 0.164), ('funny', 0.152), ('al', 0.141), ('cardinal', 0.14), ('missing', 0.139), ('forbus', 0.127), ('verbnet', 0.12), ('caused', 0.118), ('kipper', 0.117), ('lungs', 0.115), ('mcfate', 0.115), ('nulex', 0.115), ('proper', 0.114), ('failed', 0.108), ('heart', 0.096), ('lexicon', 0.096), ('nouns', 0.095), ('verb', 0.093), ('ctionary', 0.086), ('matuszek', 0.086), ('opencyc', 0.086), ('researchcyc', 0.086), ('tomai', 0.086), ('subcategorization', 0.083), ('resources', 0.08), ('fellbaum', 0.079), ('orth', 0.076), ('george', 0.072), ('failures', 0.07), ('medical', 0.062), ('levin', 0.06), ('frames', 0.059), ('parser', 0.059), ('miller', 0.058), ('ciple', 0.058), ('compliant', 0.058), ('cycl', 0.058), ('evanston', 0.058), ('felbaum', 0.058), ('finitionindi', 0.058), ('imple', 0.058), ('malformed', 0.058), ('northwestern', 0.058), ('parti', 0.058), ('failure', 0.057), ('coverage', 0.057), ('got', 0.056), ('pre', 0.056), ('consisted', 0.052), ('allen', 0.051), ('synset', 0.051), ('northwe', 0.051), ('agr', 0.047), ('plural', 0.044), ('recipient', 0.044), ('karin', 0.044), ('adverbs', 0.044), ('tense', 0.044), ('compiled', 0.042), ('additions', 0.042), ('wanted', 0.042), ('give', 0.041), ('root', 0.041), ('memory', 0.04), ('washington', 0.039), ('lack', 0.038), ('kb', 0.038), ('queried', 0.038), ('frame', 0.038), ('sentences', 0.038), ('existing', 0.038), ('errors', 0.037), ('furthermore', 0.036), ('chart', 0.036), ('unique', 0.036), ('counting', 0.035), ('lemmas', 0.034), ('parsers', 0.034), ('syntactic', 0.033), ('conjunctions', 0.033), ('synsets', 0.033), ('adjectives', 0.033), ('marcus', 0.033), ('representations', 0.032), ('taken', 0.032), ('numbers', 0.031), ('extensive', 0.031), ('enriched', 0.031), ('grishman', 0.031), ('added', 0.03), ('kenneth', 0.03), ('aaai', 0.029), ('sekine', 0.029), ('noun', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon

Author: Clifton McFate ; Kenneth Forbus

Abstract: Broad coverage lexicons for the English language have traditionally been handmade. This approach, while accurate, requires too much human labor. Furthermore, resources contain gaps in coverage, contain specific types of information, or are incompatible with other resources. We believe that the state of open-license technology is such that a comprehensive syntactic lexicon can be automatically compiled. This paper describes the creation of such a lexicon, NU-LEX, an open-license feature-based lexicon for general purpose parsing that combines WordNet, VerbNet, and Wiktionary and contains over 100,000 words. NU-LEX was integrated into a bottom up chart parser. We ran the parser through three sets of sentences, 50 sentences total, from the Simple English Wikipedia and compared its performance to the same parser using Comlex. Both parsers performed almost equally with NU-LEX finding all lex-items for 50% of the sentences and Comlex succeeding for 52%. Furthermore, NULEX’s shortcomings primarily fell into two categories, suggesting future research directions. 1

2 0.099804834 167 acl-2011-Improving Dependency Parsing with Semantic Classes

Author: Eneko Agirre ; Kepa Bengoetxea ; Koldo Gojenola ; Joakim Nivre

Abstract: This paper presents the introduction of WordNet semantic classes in a dependency parser, obtaining improvements on the full Penn Treebank for the first time. We tried different combinations of some basic semantic classes and word sense disambiguation algorithms. Our experiments show that selecting the adequate combination of semantic features on development data is key for success. Given the basic nature of the semantic classes and word sense disambiguation algorithms used, we think there is ample room for future improvements. 1

3 0.075235583 162 acl-2011-Identifying the Semantic Orientation of Foreign Words

Author: Ahmed Hassan ; Amjad AbuJbara ; Rahul Jha ; Dragomir Radev

Abstract: We present a method for identifying the positive or negative semantic orientation of foreign words. Identifying the semantic orientation of words has numerous applications in the areas of text classification, analysis of product review, analysis of responses to surveys, and mining online discussions. Identifying the semantic orientation of English words has been extensively studied in literature. Most of this work assumes the existence of resources (e.g. Wordnet, seeds, etc) that do not exist in foreign languages. In this work, we describe a method based on constructing a multilingual network connecting English and foreign words. We use this network to identify the semantic orientation of foreign words based on connection between words in the same language as well as multilingual connections. The method is experimentally tested using a manually labeled set of positive and negative words and has shown very promising results.

4 0.070472308 274 acl-2011-Semi-Supervised Frame-Semantic Parsing for Unknown Predicates

Author: Dipanjan Das ; Noah A. Smith

Abstract: We describe a new approach to disambiguating semantic frames evoked by lexical predicates previously unseen in a lexicon or annotated data. Our approach makes use of large amounts of unlabeled data in a graph-based semi-supervised learning framework. We construct a large graph where vertices correspond to potential predicates and use label propagation to learn possible semantic frames for new ones. The label-propagated graph is used within a frame-semantic parser and, for unknown predicates, results in over 15% absolute improvement in frame identification accuracy and over 13% absolute improvement in full frame-semantic parsing F1 score on a blind test set, over a state-of-the-art supervised baseline.

5 0.069745608 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

Author: Dmitriy Dligach ; Martha Palmer

Abstract: Active Learning (AL) is typically initialized with a small seed of examples selected randomly. However, when the distribution of classes in the data is skewed, some classes may be missed, resulting in a slow learning progress. Our contribution is twofold: (1) we show that an unsupervised language modeling based technique is effective in selecting rare class examples, and (2) we use this technique for seeding AL and demonstrate that it leads to a higher learning rate. The evaluation is conducted in the context of word sense disambiguation.

6 0.061317004 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment

7 0.056180332 7 acl-2011-A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality

8 0.055292968 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

9 0.049208481 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary

10 0.048142493 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

11 0.048089024 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

12 0.047795523 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

13 0.047268271 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes

14 0.046537735 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

15 0.046444751 298 acl-2011-The ACL Anthology Searchbench

16 0.044965319 44 acl-2011-An exponential translation model for target language morphology

17 0.044701394 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

18 0.043915048 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

19 0.043484423 282 acl-2011-Shift-Reduce CCG Parsing

20 0.043410044 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.124), (1, 0.011), (2, -0.043), (3, -0.048), (4, -0.021), (5, -0.006), (6, 0.069), (7, -0.008), (8, -0.006), (9, -0.049), (10, -0.042), (11, -0.049), (12, -0.011), (13, 0.02), (14, -0.005), (15, -0.065), (16, 0.065), (17, 0.006), (18, -0.012), (19, -0.02), (20, 0.048), (21, 0.055), (22, -0.012), (23, -0.01), (24, -0.018), (25, 0.007), (26, 0.009), (27, 0.011), (28, -0.006), (29, 0.012), (30, 0.065), (31, -0.038), (32, 0.027), (33, -0.071), (34, 0.02), (35, -0.001), (36, 0.013), (37, -0.018), (38, 0.042), (39, 0.026), (40, 0.005), (41, -0.009), (42, -0.075), (43, -0.066), (44, -0.001), (45, -0.014), (46, -0.015), (47, -0.026), (48, 0.005), (49, -0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93035358 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon

Author: Clifton McFate ; Kenneth Forbus

Abstract: Broad coverage lexicons for the English language have traditionally been handmade. This approach, while accurate, requires too much human labor. Furthermore, resources contain gaps in coverage, contain specific types of information, or are incompatible with other resources. We believe that the state of open-license technology is such that a comprehensive syntactic lexicon can be automatically compiled. This paper describes the creation of such a lexicon, NU-LEX, an open-license feature-based lexicon for general purpose parsing that combines WordNet, VerbNet, and Wiktionary and contains over 100,000 words. NU-LEX was integrated into a bottom up chart parser. We ran the parser through three sets of sentences, 50 sentences total, from the Simple English Wikipedia and compared its performance to the same parser using Comlex. Both parsers performed almost equally with NU-LEX finding all lex-items for 50% of the sentences and Comlex succeeding for 52%. Furthermore, NULEX’s shortcomings primarily fell into two categories, suggesting future research directions. 1

2 0.64755201 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

Author: Oleksandr Kolomiyets ; Steven Bethard ; Marie-Francine Moens

Abstract: We explore a semi-supervised approach for improving the portability of time expression recognition to non-newswire domains: we generate additional training examples by substituting temporal expression words with potential synonyms. We explore using synonyms both from WordNet and from the Latent Words Language Model (LWLM), which predicts synonyms in context using an unsupervised approach. We evaluate a state-of-the-art time expression recognition system trained both with and without the additional training examples using data from TempEval 2010, Reuters and Wikipedia. We find that the LWLM provides substantial improvements on the Reuters corpus, and smaller improvements on the Wikipedia corpus. We find that WordNet alone never improves performance, though intersecting the examples from the LWLM and WordNet provides more stable results for Wikipedia. 1

3 0.62947106 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

Author: Saif Mohammad

Abstract: Colour is a key component in the successful dissemination of information. Since many real-world concepts are associated with colour, for example danger with red, linguistic information is often complemented with the use of appropriate colours in information visualization and product marketing. Yet, there is no comprehensive resource that captures concept–colour associations. We present a method to create a large word–colour association lexicon by crowdsourcing. A wordchoice question was used to obtain sense-level annotations and to ensure data quality. We focus especially on abstract concepts and emotions to show that even they tend to have strong colour associations. Thus, using the right colours can not only improve semantic coherence, but also inspire the desired emotional response.

4 0.61430633 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment

Author: Peter LoBue ; Alexander Yates

Abstract: Understanding language requires both linguistic knowledge and knowledge about how the world works, also known as common-sense knowledge. We attempt to characterize the kinds of common-sense knowledge most often involved in recognizing textual entailments. We identify 20 categories of common-sense knowledge that are prevalent in textual entailment, many of which have received scarce attention from researchers building collections of knowledge.

5 0.60423875 162 acl-2011-Identifying the Semantic Orientation of Foreign Words

Author: Ahmed Hassan ; Amjad AbuJbara ; Rahul Jha ; Dragomir Radev

Abstract: We present a method for identifying the positive or negative semantic orientation of foreign words. Identifying the semantic orientation of words has numerous applications in the areas of text classification, analysis of product review, analysis of responses to surveys, and mining online discussions. Identifying the semantic orientation of English words has been extensively studied in literature. Most of this work assumes the existence of resources (e.g. Wordnet, seeds, etc) that do not exist in foreign languages. In this work, we describe a method based on constructing a multilingual network connecting English and foreign words. We use this network to identify the semantic orientation of foreign words based on connection between words in the same language as well as multilingual connections. The method is experimentally tested using a manually labeled set of positive and negative words and has shown very promising results.

6 0.59437686 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

7 0.56817979 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

8 0.5611701 297 acl-2011-That's What She Said: Double Entendre Identification

9 0.54393542 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

10 0.54302239 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds

11 0.53199643 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

12 0.52861589 174 acl-2011-Insights from Network Structure for Text Mining

13 0.51828706 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

14 0.51274103 167 acl-2011-Improving Dependency Parsing with Semantic Classes

15 0.50671911 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

16 0.50392556 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

17 0.48360083 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping

18 0.48101947 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

19 0.47427046 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

20 0.46676263 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.042), (6, 0.319), (17, 0.032), (26, 0.033), (31, 0.017), (37, 0.074), (39, 0.029), (41, 0.049), (53, 0.012), (55, 0.03), (59, 0.068), (72, 0.057), (91, 0.031), (96, 0.088), (97, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72953248 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon

Author: Clifton McFate ; Kenneth Forbus

Abstract: Broad coverage lexicons for the English language have traditionally been handmade. This approach, while accurate, requires too much human labor. Furthermore, resources contain gaps in coverage, contain specific types of information, or are incompatible with other resources. We believe that the state of open-license technology is such that a comprehensive syntactic lexicon can be automatically compiled. This paper describes the creation of such a lexicon, NU-LEX, an open-license feature-based lexicon for general purpose parsing that combines WordNet, VerbNet, and Wiktionary and contains over 100,000 words. NU-LEX was integrated into a bottom up chart parser. We ran the parser through three sets of sentences, 50 sentences total, from the Simple English Wikipedia and compared its performance to the same parser using Comlex. Both parsers performed almost equally with NU-LEX finding all lex-items for 50% of the sentences and Comlex succeeding for 52%. Furthermore, NULEX’s shortcomings primarily fell into two categories, suggesting future research directions. 1

2 0.58430463 93 acl-2011-Dealing with Spurious Ambiguity in Learning ITG-based Word Alignment

Author: Shujian Huang ; Stephan Vogel ; Jiajun Chen

Abstract: Word alignment has an exponentially large search space, which often makes exact inference infeasible. Recent studies have shown that inversion transduction grammars are reasonable constraints for word alignment, and that the constrained space could be efficiently searched using synchronous parsing algorithms. However, spurious ambiguity may occur in synchronous parsing and cause problems in both search efficiency and accuracy. In this paper, we conduct a detailed study of the causes of spurious ambiguity and how it effects parsing and discriminative learning. We also propose a variant of the grammar which eliminates those ambiguities. Our grammar shows advantages over previous grammars in both synthetic and real-world experiments.

3 0.56055272 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

Author: Amjad Abu-Jbara ; Dragomir Radev

Abstract: In this paper we present Clairlib, an opensource toolkit for Natural Language Processing, Information Retrieval, and Network Analysis. Clairlib provides an integrated framework intended to simplify a number of generic tasks within and across those three areas. It has a command-line interface, a graphical interface, and a documented API. Clairlib is compatible with all the common platforms and operating systems. In addition to its own functionality, it provides interfaces to external software and corpora. Clairlib comes with a comprehensive documentation and a rich set of tutorials and visual demos.

4 0.44340241 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features

Author: Yuval Marton ; Nizar Habash ; Owen Rambow

Abstract: We explore the contribution of morphological features both lexical and inflectional to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we find that definiteness, person, number, gender, and the undiacritzed lemma are most helpful for parsing on automatically tagged input. We further contrast the contribution of form-based and functional features, and show that functional gender and number (e.g., “broken plurals”) and the related rationality feature improve over form-based features. It is the first time functional morphological features are used for Arabic NLP. – –

5 0.43671116 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing

Author: John Lee ; Jason Naradowsky ; David A. Smith

Abstract: Most previous studies of morphological disambiguation and dependency parsing have been pursued independently. Morphological taggers operate on n-grams and do not take into account syntactic relations; parsers use the “pipeline” approach, assuming that morphological information has been separately obtained. However, in morphologically-rich languages, there is often considerable interaction between morphology and syntax, such that neither can be disambiguated without the other. In this paper, we propose a discriminative model that jointly infers morphological properties and syntactic structures. In evaluations on various highly-inflected languages, this joint model outperforms both a baseline tagger in morphological disambiguation, and a pipeline parser in head selection.

6 0.43406522 167 acl-2011-Improving Dependency Parsing with Semantic Classes

7 0.43041575 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

8 0.42943671 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

9 0.42847326 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment

10 0.427378 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

11 0.42729467 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

12 0.42722252 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

13 0.42681283 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic

14 0.42664602 311 acl-2011-Translationese and Its Dialects

15 0.42603725 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

16 0.4260025 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

17 0.42555976 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

18 0.42549711 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

19 0.42493704 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

20 0.42370471 307 acl-2011-Towards Tracking Semantic Change by Visual Analytics