acl acl2010 acl2010-31 knowledge-graph by maker-knowledge-mining

31 acl-2010-Annotation


Source: pdf

Author: Eduard Hovy

Abstract: unkown-abstract

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Introduction As researchers seek to apply their machine learning algorithms to new problems, corpus annotation is increasingly gaining importance in the NLP community. [sent-3, score-0.778]

2 To attend, no special expertise in computation or linguistics is required. [sent-5, score-0.086]

3 Content Overview This tutorial is intended to provide the attendee with an in-depth look at the procedures, issues, and problems in corpus annotation, and highlights the pitfalls that the annotation manager should avoid. [sent-7, score-0.906]

4 The tutorial first discusses why annotation is becoming increasingly relevant for NLP and how it fits into the generic NLP methodology of trainevaluate-apply. [sent-8, score-1.08]

5 It then reviews currently available resources, services, and frameworks that support someone wishing to start an annotation project easily. [sent-9, score-0.939]

6 This includes the QDAP annotation center, Amazon창€™s Mechanical Turk, annotation facilities in GATE, and other resources such as UIMA. [sent-10, score-1.091]

7 It then discusses the seven major open issues at the heart of annotation for which there are as yet no standard and fully satisfactory answers or methods. [sent-11, score-1.155]

8 Each issue is described in detail and current practice is shown. [sent-12, score-0.043]

9 How does one decide what specific phenomena to annotate? [sent-14, score-0.101]

10 How does one adequately capture the theory behind the phenomenon/a and express it in simple annotation instructions? [sent-15, score-0.749]

11 How does one obtain a balanced corpus to annotate, and when is a corpus balanced (and representative)? [sent-17, score-0.32]

12 How does one ensure that they are adequately (but not over- or under-) trained? [sent-20, score-0.315]

13 How does one 4 establish a simple, fast, and trustworthy annotation procedure? [sent-22, score-0.635]

14 How and when does one apply measures to ensure that the procedure remains on track? [sent-23, score-0.35]

15 How can one ensure that the interfaces do not influence the annotation results? [sent-27, score-0.672]

16 At which cutoff points should one redesign or re-do the annotations? [sent-31, score-0.086]

17 When, and to whom, should one release the corpus? [sent-34, score-0.064]

18 How should one report the annotation effort and results for best impact? [sent-35, score-0.465]

19 The notes include several pages of references and suggested readings. [sent-36, score-0.093]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('annotation', 0.465), ('seven', 0.2), ('adequately', 0.184), ('tutorial', 0.176), ('issues', 0.155), ('services', 0.142), ('ensure', 0.131), ('projects', 0.124), ('wishing', 0.124), ('closing', 0.124), ('textbook', 0.124), ('increasingly', 0.119), ('balanced', 0.117), ('maintenance', 0.113), ('trustworthy', 0.113), ('covers', 0.109), ('hovy', 0.107), ('discusses', 0.107), ('dec', 0.106), ('annotate', 0.105), ('interface', 0.102), ('someone', 0.1), ('asscolcia', 0.1), ('cgeom', 0.1), ('facilities', 0.1), ('instantiating', 0.1), ('jtuulytor', 0.1), ('nlp', 0.096), ('pra', 0.092), ('annotators', 0.09), ('remains', 0.089), ('heart', 0.089), ('gaining', 0.089), ('gate', 0.089), ('highlights', 0.089), ('satisfactory', 0.086), ('cutoff', 0.086), ('email', 0.086), ('expertise', 0.086), ('becoming', 0.083), ('sweden', 0.083), ('manager', 0.083), ('procedure', 0.083), ('southern', 0.08), ('uppsala', 0.08), ('instructions', 0.08), ('turk', 0.078), ('interfaces', 0.076), ('overview', 0.076), ('fits', 0.074), ('frameworks', 0.074), ('designing', 0.073), ('af', 0.073), ('book', 0.071), ('mechanical', 0.071), ('procedures', 0.069), ('standards', 0.069), ('amazon', 0.068), ('sc', 0.067), ('selecting', 0.065), ('release', 0.064), ('currently', 0.063), ('seek', 0.062), ('resources', 0.061), ('something', 0.061), ('project', 0.06), ('specifying', 0.06), ('store', 0.06), ('formulate', 0.059), ('eduard', 0.058), ('track', 0.058), ('accepted', 0.058), ('establish', 0.057), ('phenomena', 0.057), ('theory', 0.056), ('methodology', 0.056), ('toward', 0.055), ('published', 0.054), ('answers', 0.053), ('reviews', 0.053), ('validation', 0.052), ('active', 0.051), ('representative', 0.051), ('paradigm', 0.051), ('characteristics', 0.05), ('intended', 0.05), ('california', 0.049), ('center', 0.048), ('notes', 0.048), ('basic', 0.047), ('measures', 0.047), ('setting', 0.046), ('community', 0.045), ('suggested', 0.045), ('sciences', 0.045), ('behind', 0.044), ('fo', 0.044), ('decide', 0.044), ('detail', 0.043), ('corpus', 0.043)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 31 acl-2010-Annotation

Author: Eduard Hovy

Abstract: unkown-abstract

2 0.26862425 4 acl-2010-A Cognitive Cost Model of Annotations Based on Eye-Tracking Data

Author: Katrin Tomanek ; Udo Hahn ; Steffen Lohmann ; Jurgen Ziegler

Abstract: We report on an experiment to track complex decision points in linguistic metadata annotation where the decision behavior of annotators is observed with an eyetracking device. As experimental conditions we investigate different forms of textual context and linguistic complexity classes relative to syntax and semantics. Our data renders evidence that annotation performance depends on the semantic and syntactic complexity of the decision points and, more interestingly, indicates that fullscale context is mostly negligible with – the exception of semantic high-complexity cases. We then induce from this observational data a cognitively grounded cost model of linguistic meta-data annotations and compare it with existing non-cognitive models. Our data reveals that the cognitively founded model explains annotation costs (expressed in annotation time) more adequately than non-cognitive ones.

3 0.11603179 86 acl-2010-Discourse Structure: Theory, Practice and Use

Author: Bonnie Webber ; Markus Egg ; Valia Kordoni

Abstract: unkown-abstract

4 0.11416817 206 acl-2010-Semantic Parsing: The Task, the State of the Art and the Future

Author: Rohit J. Kate ; Yuk Wah Wong

Abstract: unkown-abstract

5 0.10163417 243 acl-2010-Tree-Based and Forest-Based Translation

Author: Yang Liu ; Liang Huang

Abstract: unkown-abstract

6 0.095719613 260 acl-2010-Wide-Coverage NLP with Linguistically Expressive Grammars

7 0.083166584 208 acl-2010-Sentence and Expression Level Annotation of Opinions in User-Generated Discourse

8 0.082722887 1 acl-2010-"Ask Not What Textual Entailment Can Do for You..."

9 0.081394084 57 acl-2010-Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation

10 0.078260869 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

11 0.077623017 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

12 0.072723798 39 acl-2010-Automatic Generation of Story Highlights

13 0.068000168 82 acl-2010-Demonstration of a Prototype for a Conversational Companion for Reminiscing about Images

14 0.066950738 171 acl-2010-Metadata-Aware Measures for Answer Summarization in Community Question Answering

15 0.063940376 259 acl-2010-WebLicht: Web-Based LRT Services for German

16 0.060809299 47 acl-2010-Beetle II: A System for Tutoring and Computational Linguistics Experimentation

17 0.060506493 139 acl-2010-Identifying Generic Noun Phrases

18 0.059348702 227 acl-2010-The Impact of Interpretation Problems on Tutorial Dialogue

19 0.057035796 190 acl-2010-P10-5005 k2opt.pdf

20 0.051432688 58 acl-2010-Classification of Feedback Expressions in Multimodal Data


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.121), (1, 0.062), (2, -0.035), (3, -0.076), (4, -0.057), (5, -0.063), (6, -0.015), (7, 0.036), (8, -0.064), (9, -0.009), (10, 0.024), (11, 0.054), (12, -0.034), (13, 0.104), (14, -0.09), (15, 0.127), (16, -0.045), (17, -0.005), (18, 0.156), (19, 0.028), (20, -0.05), (21, 0.025), (22, 0.05), (23, -0.118), (24, -0.095), (25, -0.13), (26, 0.132), (27, 0.243), (28, 0.188), (29, -0.2), (30, -0.078), (31, -0.148), (32, -0.108), (33, -0.064), (34, -0.206), (35, -0.039), (36, 0.105), (37, 0.027), (38, 0.066), (39, 0.089), (40, 0.086), (41, -0.044), (42, 0.009), (43, 0.124), (44, -0.075), (45, -0.064), (46, 0.122), (47, -0.048), (48, -0.075), (49, -0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98696011 31 acl-2010-Annotation

Author: Eduard Hovy

Abstract: unkown-abstract

2 0.7714963 4 acl-2010-A Cognitive Cost Model of Annotations Based on Eye-Tracking Data

Author: Katrin Tomanek ; Udo Hahn ; Steffen Lohmann ; Jurgen Ziegler

Abstract: We report on an experiment to track complex decision points in linguistic metadata annotation where the decision behavior of annotators is observed with an eyetracking device. As experimental conditions we investigate different forms of textual context and linguistic complexity classes relative to syntax and semantics. Our data renders evidence that annotation performance depends on the semantic and syntactic complexity of the decision points and, more interestingly, indicates that fullscale context is mostly negligible with – the exception of semantic high-complexity cases. We then induce from this observational data a cognitively grounded cost model of linguistic meta-data annotations and compare it with existing non-cognitive models. Our data reveals that the cognitively founded model explains annotation costs (expressed in annotation time) more adequately than non-cognitive ones.

3 0.69876337 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

Author: Nancy Ide ; Collin Baker ; Christiane Fellbaum ; Rebecca Passonneau

Abstract: The Manually Annotated Sub-Corpus (MASC) project provides data and annotations to serve as the base for a communitywide annotation effort of a subset of the American National Corpus. The MASC infrastructure enables the incorporation of contributed annotations into a single, usable format that can then be analyzed as it is or ported to any of a variety of other formats. MASC includes data from a much wider variety of genres than existing multiply-annotated corpora of English, and the project is committed to a fully open model of distribution, without restriction, for all data and annotations produced or contributed. As such, MASC is the first large-scale, open, communitybased effort to create much needed language resources for NLP. This paper describes the MASC project, its corpus and annotations, and serves as a call for contributions of data and annotations from the language processing community.

4 0.58291572 259 acl-2010-WebLicht: Web-Based LRT Services for German

Author: Erhard Hinrichs ; Marie Hinrichs ; Thomas Zastrow

Abstract: This software demonstration presents WebLicht (short for: Web-Based Linguistic Chaining Tool), a webbased service environment for the integration and use of language resources and tools (LRT). WebLicht is being developed as part of the D-SPIN project1. WebLicht is implemented as a web application so that there is no need for users to install any software on their own computers or to concern themselves with the technical details involved in building tool chains. The integrated web services are part of a prototypical infrastructure that was developed to facilitate chaining of LRT services. WebLicht allows the integration and use of distributed web services with standardized APIs. The nature of these open and standardized APIs makes it possible to access the web services from nearly any programming language, shell script or workflow engine (UIMA, Gate etc.) Additionally, an application for integration of additional services is available, allowing anyone to contribute his own web service. 1

5 0.58125478 57 acl-2010-Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation

Author: Michael Bloodgood ; Chris Callison-Burch

Abstract: We explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it, gathering annotations via Amazon Mechanical Turk, and find that we get an order of magnitude increase in performance rates of improvement.

6 0.51561612 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

7 0.41230232 190 acl-2010-P10-5005 k2opt.pdf

8 0.40858343 58 acl-2010-Classification of Feedback Expressions in Multimodal Data

9 0.36602253 206 acl-2010-Semantic Parsing: The Task, the State of the Art and the Future

10 0.36380211 208 acl-2010-Sentence and Expression Level Annotation of Opinions in User-Generated Discourse

11 0.35528305 260 acl-2010-Wide-Coverage NLP with Linguistically Expressive Grammars

12 0.32993591 1 acl-2010-"Ask Not What Textual Entailment Can Do for You..."

13 0.32791799 86 acl-2010-Discourse Structure: Theory, Practice and Use

14 0.30912822 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing

15 0.27752304 225 acl-2010-Temporal Information Processing of a New Language: Fast Porting with Minimal Resources

16 0.2648426 82 acl-2010-Demonstration of a Prototype for a Conversational Companion for Reminiscing about Images

17 0.23971555 194 acl-2010-Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning

18 0.2376584 64 acl-2010-Complexity Assumptions in Ontology Verbalisation

19 0.23733129 139 acl-2010-Identifying Generic Noun Phrases

20 0.23723553 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.272), (25, 0.035), (39, 0.021), (42, 0.013), (44, 0.104), (59, 0.101), (72, 0.01), (73, 0.067), (83, 0.183), (84, 0.015), (98, 0.088)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83141309 31 acl-2010-Annotation

Author: Eduard Hovy

Abstract: unkown-abstract

2 0.79394269 30 acl-2010-An Open-Source Package for Recognizing Textual Entailment

Author: Milen Kouylekov ; Matteo Negri

Abstract: This paper presents a general-purpose open source package for recognizing Textual Entailment. The system implements a collection of algorithms, providing a configurable framework to quickly set up a working environment to experiment with the RTE task. Fast prototyping of new solutions is also allowed by the possibility to extend its modular architecture. We present the tool as a useful resource to approach the Textual Entailment problem, as an instrument for didactic purposes, and as an opportunity to create a collaborative environment to promote research in the field.

3 0.65197384 114 acl-2010-Faster Parsing by Supertagger Adaptation

Author: Jonathan K. Kummerfeld ; Jessika Roesner ; Tim Dawborn ; James Haggerty ; James R. Curran ; Stephen Clark

Abstract: We propose a novel self-training method for a parser which uses a lexicalised grammar and supertagger, focusing on increasing the speed of the parser rather than its accuracy. The idea is to train the supertagger on large amounts of parser output, so that the supertagger can learn to supply the supertags that the parser will eventually choose as part of the highestscoring derivation. Since the supertagger supplies fewer supertags overall, the parsing speed is increased. We demonstrate the effectiveness of the method using a CCG supertagger and parser, obtain- ing significant speed increases on newspaper text with no loss in accuracy. We also show that the method can be used to adapt the CCG parser to new domains, obtaining accuracy and speed improvements for Wikipedia and biomedical text.

4 0.61635059 165 acl-2010-Learning Script Knowledge with Web Experiments

Author: Michaela Regneri ; Alexander Koller ; Manfred Pinkal

Abstract: We describe a novel approach to unsupervised learning of the events that make up a script, along with constraints on their temporal ordering. We collect naturallanguage descriptions of script-specific event sequences from volunteers over the Internet. Then we compute a graph representation of the script’s temporal structure using a multiple sequence alignment algorithm. The evaluation of our system shows that we outperform two informed baselines.

5 0.61568254 73 acl-2010-Coreference Resolution with Reconcile

Author: Veselin Stoyanov ; Claire Cardie ; Nathan Gilbert ; Ellen Riloff ; David Buttler ; David Hysom

Abstract: Despite the existence of several noun phrase coreference resolution data sets as well as several formal evaluations on the task, it remains frustratingly difficult to compare results across different coreference resolution systems. This is due to the high cost of implementing a complete end-to-end coreference resolution system, which often forces researchers to substitute available gold-standard information in lieu of implementing a module that would compute that information. Unfortunately, this leads to inconsistent and often unrealistic evaluation scenarios. With the aim to facilitate consistent and realistic experimental evaluations in coreference resolution, we present Reconcile, an infrastructure for the development of learning-based noun phrase (NP) coreference resolution systems. Reconcile is designed to facilitate the rapid creation of coreference resolution systems, easy implementation of new feature sets and approaches to coreference res- olution, and empirical evaluation of coreference resolvers across a variety of benchmark data sets and standard scoring metrics. We describe Reconcile and present experimental results showing that Reconcile can be used to create a coreference resolver that achieves performance comparable to state-ofthe-art systems on six benchmark data sets.

6 0.61510688 1 acl-2010-"Ask Not What Textual Entailment Can Do for You..."

7 0.6110003 219 acl-2010-Supervised Noun Phrase Coreference Research: The First Fifteen Years

8 0.6100682 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

9 0.60522032 72 acl-2010-Coreference Resolution across Corpora: Languages, Coding Schemes, and Preprocessing Information

10 0.60484165 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization

11 0.60139394 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

12 0.60128516 4 acl-2010-A Cognitive Cost Model of Annotations Based on Eye-Tracking Data

13 0.59480435 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features

14 0.59358549 112 acl-2010-Extracting Social Networks from Literary Fiction

15 0.59325325 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

16 0.59142143 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

17 0.59087545 243 acl-2010-Tree-Based and Forest-Based Translation

18 0.58806401 60 acl-2010-Collocation Extraction beyond the Independence Assumption

19 0.58543843 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields

20 0.58477277 134 acl-2010-Hierarchical Sequential Learning for Extracting Opinions and Their Attributes