acl acl2012 acl2012-49 knowledge-graph by maker-knowledge-mining

49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study


Source: pdf

Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith

Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. [sent-6, score-0.228]

2 In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. [sent-7, score-0.709]

3 1 Introduction The goal of “lightweight” semantic annotation of text, particularly in scenarios with limited resources and expertise, presents several requirements for a representation: simplicity; adaptability to new languages, topics, and genres; and coverage. [sent-9, score-0.23]

4 This paper describes coarse lexical semantic annotation of Arabic Wikipedia articles subject to these con- straints. [sent-10, score-0.291]

5 Traditional lexical semantic representations are either narrow in scope, like named make reference to a full-fledged entities,1 or lexicon/ontology, which may insufficiently cover the language/domain of interest or require prohibitive fort to apply. [sent-11, score-0.095]

6 2 expertise and ef- We therefore turn to supersense tags (SSTs), 40 coarse lexical semantic classes (25 for nouns, 15 for verbs) originating in WordNet. [sent-12, score-0.73]

7 Previously these served as groupings of English lexicon 1Some ontologies like those in Sekine et al. [sent-13, score-0.049]

8 , a WordNet (Fellbaum, 1998) sense annotation effort reported by Passonneau et al. [sent-18, score-0.191]

9 (2010) found considerable interannotator variability for some lexemes; FrameNet (Baker et al. [sent-19, score-0.074]

10 , 1998) is limited in coverage, even for English; and PropBank (Kingsbury and Palmer, 2002) does not capture semantic relationships across lexemes. [sent-20, score-0.047]

11 , 2003) has been used for fine-grained crosslingual annotation (Hovy et al. [sent-22, score-0.134]

12 @ book Guinness for-records the-standard that COMMUNICATION Ø? [sent-48, score-0.054]

13 year AD TIME ‘The Guinness Book of World Records considers the University of Al-Karaouine in Fez, Morocco, established in the year 859 AD, the oldest university in the world. [sent-95, score-0.049]

14 ’ Figure 1: A sentence from the article “Islamic Golden Age,” with the supersense tagging from one of two annotators. [sent-96, score-0.632]

15 Part of the earliest versions of WordNet, the supersense categories (originally, “lexicographer classes”) were intended to partition all English noun and verb senses into broad groupings, or semantic fields (Miller, 1990; Fellbaum, 1990). [sent-99, score-0.615]

16 More recently, the task of automatic supersense tagging has emerged for English (Ciaramita and Johnson, 2003; Curran, 2005; Ciaramita and Altun, 2006; Paaß and Reichartz, 2009), as well as for Italian (Picca et al. [sent-100, score-0.632]

17 3 mapped to English lieve supersenses In principle, we be- ought to apply to nouns and verbs in any language, and need not depend on the availability of a semantic lexicon. [sent-105, score-0.24]

18 , 3Note that work in supersense tagging used text with finegrained sense annotations that were then coarsened to SSTs. [sent-109, score-0.689]

19 The 7 article titles (translated) in each domain, with total counts of sentences, tokens, and supersense mentions. [sent-114, score-0.568]

20 Overall, there are 2,219 sentences with 65,452 tokens and 23,239 mentions (1. [sent-115, score-0.059]

21 Counts exclude sentences marked as problematic and mentions marked ? [sent-117, score-0.155]

22 We encapsulate our interpretation of the tags in a set of brief guidelines that aims to be usable by anyone who can read and understand a text in the target language; our annotators had no prior expertise in linguistics or linguistic annotation. [sent-124, score-0.244]

23 Finally, we note that ad hoc categorization schemes not unlike SSTs have been developed for purposes ranging from question answering (Li and Roth, 2002) to animacy hierarchy representation for corpus linguistics (Zaenen et al. [sent-125, score-0.042]

24 We believe the interpretation of the SSTs adopted here can serve as a single starting point for diverse resource engineering efforts and applications, especially when fine-grained sense annotation is not feasible. [sent-127, score-0.191]

25 2 Tagging Conventions WordNet’s definitions of the supersenses are terse, and we could find little explicit discussion of the specific rationales behind each category. [sent-128, score-0.114]

26 Thus, we have crafted more specific explanations, summarized for nouns in figure 2. [sent-129, score-0.079]

27 English examples are given, but the guidelines are intended to be language-neutral. [sent-130, score-0.083]

28 5 In developing these guidelines we consulted English WordNet (Fellbaum, 1998) and SemCor (Miller et al. [sent-132, score-0.14]

29 3 Arabic Wikipedia Annotation The annotation in this work was on top of a small corpus of Arabic Wikipedia articles that had already been annotated for named entities (Mohit et al. [sent-139, score-0.24]

30 The dataset (table 1) consists of the main text of 28 articles selected from the topical domains of history, sports, science, and technology. [sent-143, score-0.058]

31 The annotation task was to identify and categorize mentions, i. [sent-144, score-0.134]

32 Working in a custom, browserbased interface, annotators were to tag each relevant token with a supersense category by selecting the token and typing a tag symbol. [sent-147, score-0.754]

33 Any token could be marked as continuing a multiword unit by typing <. [sent-148, score-0.092]

34 If the annotator was ambivalent about a token they were to mark it with the ? [sent-149, score-0.111]

35 Over several months, annotators alternately annotated sentences from 2 designated articles of each domain, and reviewed the annotations for consistency. [sent-154, score-0.156]

36 All tagging conventions were developed collaboratively by the author(s) and annotators during this period, informed by points of confusion and disagreement. [sent-155, score-0.269]

37 WordNet and SemCor were consulted as part of developing the guidelines, but not during annotation itself so as to avoid complicating the annotation process or overfitting to WordNet’s idiosyncracies. [sent-156, score-0.325]

38 The training phase ended once interannotator mention F1 had reached 75%. [sent-157, score-0.074]

39 6Suggestions came from the previous named entity annota- tion of PERSONs, organizations (GROUP), and LOCATIONs, as well as heuristic lookup in lexical resources—Arabic WordNet entries (Elkateb et al. [sent-158, score-0.048]

40 , 2006) mapped to English WordNet, and named entities in OntoNotes (Hovy et al. [sent-159, score-0.048]

41 A connection is a RELATION; project, support, and a configuration are tagged as COGNITION; development and collaboration are ACTs. [sent-166, score-0.067]

42 Arabic conventions Masdar constructions (verbal nouns) are treated as nouns. [sent-167, score-0.06]

43 Sports championships/tournaments are EVENTs (Information) Technology Software names, kinds, and components are tagged as COMMUNICATION (e. [sent-169, score-0.067]

44 kernel, Figure 2: Above: The complete supersense tagset for nouns; each tag is briefly described by its symbol, NAME, short description, and examples. [sent-171, score-0.568]

45 Throughout the process, annotators were encouraged to discuss points of confusion with each other, but each sentence was annotated in its entirety and never revisited. [sent-175, score-0.145]

46 To measure inter-annotator agreement, 87 sentences (2,774 tokens) distributed across 19 of the articles (not including those used in pilot rounds) were annotated independently by each annotator. [sent-182, score-0.058]

47 Interannotator mention F1 (counting agreement over entire mentions and their labels) was 70%. [sent-183, score-0.109]

48 Excluding the 1,397 tokens left blank by both annotators, the token-level agreement rate was 71%, with Cohen’s κ = 0. [sent-184, score-0.05]

49 7 We also measured agreement on a tag-by-tag ba- sis. [sent-186, score-0.05]

50 An examination of the confusion matrix reveals four pairs of supersense categories that tended to provoke the most disagreement: COMMUNICATION/COGNITION, ACT/COGNITION, ACT/PROCESS, and ARTIFACT/COMMUNICATION. [sent-189, score-0.615]

51 7Token-level measures consider both the supersense label and whether it begins or continues the mention. [sent-190, score-0.568]

52 256 where one annotator chose ARTIFACT (referring to the physical book) while the other chose COMMUNICATION (the content). [sent-191, score-0.067]

53 Also in that sentence, annotators disagreed on the second use of university (ARTIFACT vs. [sent-192, score-0.098]

54 As with any sense annotation effort, some disagreements due to legitimate ambiguity and different interpretations of the tags— especially the broadest ones—are unavoidable. [sent-194, score-0.191]

55 A “soft” agreement measure (counting as matches any two mentions with the same label and at least one token in common) gives an F1 of 79%, show- ing that boundary decisions account for a major portion of the disagreement. [sent-195, score-0.153]

56 , the city Fez, Morocco (figure 1) was tagged as a single LOCATION by one annotator and as two by the other. [sent-198, score-0.134]

57 Further examples include the technical term ‘thin client’, for which one annotator omitted the adjective; and ‘World Cup Football Championship’, where one annotator tagged the entire phrase as an EVENT while the other tagged ‘football’ as a separate ACT. [sent-199, score-0.268]

58 4 Conclusion We have codified supersense tags as a simple annotation scheme for coarse lexical semantics, and have shown that supersense annotation of Arabic Wikipedia can be rapid, reliable, and robust (about half the tokens in our data are covered by a nominal supersense). [sent-200, score-1.456]

59 Our tagging guidelines and corpus are available for download at http : / /www . [sent-201, score-0.147]

60 A resource and tool for super-sense tagging of Italian texts. [sent-211, score-0.064]

61 Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. [sent-233, score-0.625]

62 Interlingual annotation of parallel text corpora: a new framework for annotation and evaluation. [sent-250, score-0.268]

63 Word sense annotation of polysemous words by multiple annotators. [sent-304, score-0.191]

64 Combining contextual and structural information for supersense tagging of Chinese unknown words. [sent-325, score-0.632]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('supersense', 0.568), ('ssts', 0.199), ('arabic', 0.165), ('annotation', 0.134), ('wordnet', 0.124), ('islamic', 0.114), ('picca', 0.114), ('supersenses', 0.114), ('annotators', 0.098), ('massimiliano', 0.091), ('artifact', 0.09), ('fez', 0.085), ('guidelines', 0.083), ('morocco', 0.08), ('nouns', 0.079), ('interannotator', 0.074), ('ciaramita', 0.073), ('miller', 0.073), ('plant', 0.068), ('calzolari', 0.068), ('choukri', 0.068), ('mariani', 0.068), ('nicoletta', 0.068), ('piperidis', 0.068), ('stelios', 0.068), ('tapias', 0.068), ('annotator', 0.067), ('tagged', 0.067), ('fellbaum', 0.065), ('tagging', 0.064), ('passonneau', 0.063), ('bente', 0.063), ('elra', 0.063), ('expertise', 0.063), ('food', 0.063), ('khalid', 0.063), ('maegaard', 0.063), ('qatar', 0.06), ('conventions', 0.06), ('mentions', 0.059), ('articles', 0.058), ('sense', 0.057), ('consulted', 0.057), ('davide', 0.057), ('elkateb', 0.057), ('mosque', 0.057), ('motive', 0.057), ('paa', 0.057), ('philpot', 0.057), ('possession', 0.057), ('solaris', 0.057), ('stallman', 0.057), ('zaenen', 0.057), ('mohit', 0.054), ('book', 0.054), ('coarse', 0.052), ('christiane', 0.05), ('agreement', 0.05), ('groupings', 0.049), ('guinness', 0.049), ('ark', 0.049), ('attardi', 0.049), ('behrang', 0.049), ('canary', 0.049), ('oldest', 0.049), ('palmas', 0.049), ('resources', 0.049), ('marked', 0.048), ('named', 0.048), ('hovy', 0.047), ('cognition', 0.047), ('confusion', 0.047), ('semantic', 0.047), ('wikipedia', 0.047), ('lightweight', 0.045), ('islands', 0.045), ('kemal', 0.045), ('kingsbury', 0.045), ('odijk', 0.045), ('oflazer', 0.045), ('rosner', 0.045), ('schneider', 0.045), ('token', 0.044), ('rebecca', 0.043), ('communication', 0.043), ('animacy', 0.042), ('cup', 0.042), ('principal', 0.042), ('alfio', 0.042), ('animal', 0.042), ('gliozzo', 0.042), ('semcor', 0.042), ('object', 0.041), ('location', 0.04), ('malta', 0.04), ('bikel', 0.04), ('industrial', 0.04), ('football', 0.04), ('sports', 0.04), ('lexicography', 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith

Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.

2 0.11957845 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

Author: Spence Green ; John DeNero

Abstract: When automatically translating from a weakly inflected source language like English to a target language with richer grammatical features such as gender and dual number, the output commonly contains morpho-syntactic agreement errors. To address this issue, we present a target-side, class-based agreement model. Agreement is promoted by scoring a sequence of fine-grained morpho-syntactic classes that are predicted during decoding for each translation hypothesis. For English-to-Arabic translation, our model yields a +1.04 BLEU average improvement over a state-of-the-art baseline. The model does not require bitext or phrase table annotations and can be easily implemented as a feature in many phrase-based decoders. 1

3 0.11736114 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

4 0.11640296 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer

Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.

5 0.091587268 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

Author: Kareem Darwish ; Ahmed Ali

Abstract: Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages.

6 0.083805725 50 acl-2012-Collective Classification for Fine-grained Information Status

7 0.078232966 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions

8 0.072262257 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

9 0.071892835 195 acl-2012-The Creation of a Corpus of English Metalanguage

10 0.070609681 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

11 0.066341631 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive

12 0.063718796 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection

13 0.062499061 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

14 0.060577616 7 acl-2012-A Computational Approach to the Automation of Creative Naming

15 0.060222525 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

16 0.057235062 65 acl-2012-Crowdsourcing Inference-Rule Evaluation

17 0.05689159 134 acl-2012-Learning to Find Translations and Transliterations on the Web

18 0.055423182 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

19 0.054872088 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

20 0.054055672 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.177), (1, 0.075), (2, -0.059), (3, 0.039), (4, 0.071), (5, 0.145), (6, 0.012), (7, -0.061), (8, -0.016), (9, 0.003), (10, -0.015), (11, -0.048), (12, 0.11), (13, -0.011), (14, -0.002), (15, -0.062), (16, -0.069), (17, -0.017), (18, -0.15), (19, -0.059), (20, -0.017), (21, 0.013), (22, 0.009), (23, 0.027), (24, 0.016), (25, 0.1), (26, -0.029), (27, -0.033), (28, -0.04), (29, 0.037), (30, 0.04), (31, -0.002), (32, 0.013), (33, 0.114), (34, -0.014), (35, 0.012), (36, -0.011), (37, -0.057), (38, 0.179), (39, -0.079), (40, -0.029), (41, -0.024), (42, 0.034), (43, -0.072), (44, 0.004), (45, 0.061), (46, 0.105), (47, -0.033), (48, 0.001), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93050641 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith

Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.

2 0.62969369 195 acl-2012-The Creation of a Corpus of English Metalanguage

Author: Shomir Wilson

Abstract: Metalanguage is an essential linguistic mechanism which allows us to communicate explicit information about language itself. However, it has been underexamined in research in language technologies, to the detriment of the performance of systems that could exploit it. This paper describes the creation of the first tagged and delineated corpus of English metalanguage, accompanied by an explicit definition and a rubric for identifying the phenomenon in text. This resource will provide a basis for further studies of metalanguage and enable its utilization in language technologies.

3 0.5782612 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

4 0.55290228 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

Author: Emad Mohamed ; Behrang Mohit ; Kemal Oflazer

Abstract: We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-ofvocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.

5 0.53624809 7 acl-2012-A Computational Approach to the Automation of Creative Naming

Author: Gozde Ozbal ; Carlo Strapparava

Abstract: In this paper, we propose a computational approach to generate neologisms consisting of homophonic puns and metaphors based on the category of the service to be named and the properties to be underlined. We describe all the linguistic resources and natural language processing techniques that we have exploited for this task. Then, we analyze the performance of the system that we have developed. The empirical results show that our approach is generally effective and it constitutes a solid starting point for the automation ofthe naming process.

6 0.53066176 50 acl-2012-Collective Classification for Fine-grained Information Status

7 0.52478176 27 acl-2012-Arabic Retrieval Revisited: Morphological Hole Filling

8 0.44303641 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

9 0.44219416 137 acl-2012-Lemmatisation as a Tagging Task

10 0.42187831 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

11 0.41941664 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

12 0.41842848 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API

13 0.40632203 186 acl-2012-Structuring E-Commerce Inventory

14 0.39194241 189 acl-2012-Syntactic Annotations for the Google Books NGram Corpus

15 0.38961837 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

16 0.38928032 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction

17 0.38247815 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive

18 0.38168597 58 acl-2012-Coreference Semantics from Web Features

19 0.37930131 207 acl-2012-Unsupervised Morphology Rivals Supervised Morphology for Arabic MT

20 0.37023079 73 acl-2012-Discriminative Learning for Joint Template Filling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.482), (26, 0.038), (28, 0.024), (30, 0.034), (37, 0.021), (39, 0.051), (74, 0.02), (82, 0.013), (84, 0.028), (85, 0.039), (90, 0.084), (92, 0.026), (94, 0.012), (99, 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90680975 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith

Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.

2 0.86651176 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

3 0.78506839 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

Author: Marcis Pinnis ; Radu Ion ; Dan Stefanescu ; Fangzhong Su ; Inguna Skadina ; Andrejs Vasiljevs ; Bogdan Babych

Abstract: The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English, Latvian, Lithuanian, and Romanian languages.

4 0.62222612 56 acl-2012-Computational Approaches to Sentence Completion

Author: Geoffrey Zweig ; John C. Platt ; Christopher Meek ; Christopher J.C. Burges ; Ainur Yessenalina ; Qiang Liu

Abstract: This paper studies the problem of sentencelevel semantic coherence by answering SATstyle sentence completion questions. These questions test the ability of algorithms to distinguish sense from nonsense based on a variety of sentence-level phenomena. We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model; and methods that evaluate global coherence, such as latent semantic analysis. We evaluate these methods on a suite of practice SAT questions, and on a recently released sentence completion task based on data taken from five Conan Doyle novels. We find that by fusing local and global information, we can exceed 50% on this task (chance baseline is 20%), and we suggest some avenues for further research.

5 0.37033245 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

Author: Richard Eckart de Castilho ; Sabine Bartsch ; Iryna Gurevych

Abstract: We present CSNIPER (Corpus Sniper), a tool that implements (i) a web-based multiuser scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) evaluation of annotation quality by measuring inter-rater agreement. This annotationby-query approach efficiently harnesses expert knowledge to identify instances of linguistic phenomena that are hard to identify by means of existing automatic annotation tools.

6 0.36649695 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

7 0.35105821 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

8 0.34804299 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

9 0.34234345 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API

10 0.33768329 161 acl-2012-Polarity Consistency Checking for Sentiment Dictionaries

11 0.33531302 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

12 0.33299321 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

13 0.33192137 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

14 0.33169243 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

15 0.32987395 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

16 0.32972041 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

17 0.32715818 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

18 0.32148489 145 acl-2012-Modeling Sentences in the Latent Space

19 0.31596941 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

20 0.31584713 50 acl-2012-Collective Classification for Fine-grained Information Status