emnlp emnlp2010 emnlp2010-37 knowledge-graph by maker-knowledge-mining

37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

Source: pdf

Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Frederick Reiss ; Shivakumar Vaithyanathan

Abstract: Named-entity recognition (NER) is an important task required in a wide variety of applications. While rule-based systems are appealing due to their well-known “explainability,” most, if not all, state-of-the-art results for NER tasks are based on machine learning techniques. Motivated by these results, we explore the following natural question in this paper: Are rule-based systems still a viable approach to named-entity recognition? Specifically, we have designed and implemented a high-level language NERL on top of SystemT, a general-purpose algebraic information extraction system. NERL is tuned to the needs of NER tasks and simplifies the process of building, understanding, and customizing complex rule-based named-entity annotators. We show that these customized annotators match or outperform the best published results achieved with machine learning techniques. These results confirm that we can reap the benefits of rule-based extractors’ explainability without sacrificing accuracy. We conclude by discussing lessons learned while building and customizing complex rule-based annotators and outlining several research directions towards facilitating rule development.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks Laura Chiticariu Rajasekar Krishnamurthy Yunyao Li Frederick Reiss Shivakumar Vaithyanathan IBM Research Almaden 650 Harry Road, San Jose, CA 95120, USA {chit i ra j ase yunyaol i frrei s s }@us . [sent-1, score-0.121]

2 com – , , , Abstract Named-entity recognition (NER) is an important task required in a wide variety of applications. [sent-5, score-0.136]

3 While rule-based systems are appealing due to their well-known “explainability,” most, if not all, state-of-the-art results for NER tasks are based on machine learning techniques. [sent-6, score-0.112]

4 Motivated by these results, we explore the following natural question in this paper: Are rule-based systems still a viable approach to named-entity recognition? [sent-7, score-0.135]

5 NERL is tuned to the needs of NER tasks and simplifies the process of building, understanding, and customizing complex rule-based named-entity annotators. [sent-9, score-0.34]

6 We show that these customized annotators match or outperform the best published results achieved with machine learning techniques. [sent-10, score-0.184]

7 These results confirm that we can reap the benefits of rule-based extractors’ explainability without sacrificing accuracy. [sent-11, score-0.34]

8 We conclude by discussing lessons learned while building and customizing complex rule-based annotators and outlining several research directions towards facilitating rule development. [sent-12, score-0.625]

9 1 Introduction Named-entity recognition (NER) is the task of identifying mentions of rigid designators from text belonging to named-entity types such as persons, organizations and locations (Nadeau and Sekine, 2007). [sent-13, score-0.279]

10 While NER over formal text such as news articles and webpages is a well-studied problem (Bikel et 1002 al. [sent-14, score-0.091]

11 , 2005), there has been recent work on NER over informal text such as emails and blogs (Huang et al. [sent-16, score-0.123]

12 The techniques proposed in the literature fall under three categories: rule-based (Krupka and Hausman, 2001 ; Sekine and Nobata, 2004), machine learning- based (O. [sent-20, score-0.065]

13 1 Motivation Although there are well-established rule-based systems to perform NER tasks, most, if not all, state-ofthe-art results for NER tasks are based on machine learning techniques. [sent-26, score-0.036]

14 However, the rule-based approach is still extremely appealing due to the associated transparency of the internal system state, which leads to better explainability of errors (Siniakov, 2010). [sent-27, score-0.558]

15 Ideally, one would like to benefit from the transparency and explainability of rule-based techniques, while achieving state-of-the-art accuracy. [sent-28, score-0.449]

16 A particularly challenging aspect of rule-based NER in practice is domain customization customizing existing annotators to produce accurate results in new domains. [sent-29, score-0.715]

17 In machine learning-based systems, adapting to a new domain has traditionally involved acquiring additional labeled data and learning a new model from scratch. [sent-30, score-0.218]

18 However, recent work has proposed more sophisticated approaches — that learn a domain-independent base model, which can later be adapted to specific domains (Florian et Proce MdiInTg,s M oaf sthseac 2h0u1s0et Ctso, UnfeSrAe,nc 9e-1 o1n O Ecmtopbireirca 2l0 M10e. [sent-31, score-0.029]

19 Sinxogfal WC u is thiton m sipoz ra t sio an r tRSicoel qus,uti roenm(eCnSt): Cmitay , rCefoeurntoy aorspSotartse ntea m e sowritohitnheslpo cratsioanrticlse slf. [sent-38, score-0.111]

20 Implementing a similar approach for rule-based NER typically requires a significant amount of manual effort to (a) identify the explicit semantic changes required for the new domain (e. [sent-44, score-0.265]

21 , differences in entity type def- inition), (b) identify the portions of the (complex) core annotator that should be modified for each difference and (c) implement the required customization rules without compromising the extraction quality of the core annotator. [sent-46, score-0.692]

22 Domain customization of rule-based NER has not received much attention in the recent literature with a few exceptions (Petasis et al. [sent-47, score-0.363]

23 2 Problem Statement In this paper, we explore the following natural question: Are rule-based systems still a viable approach to named-entity recognition? [sent-52, score-0.135]

24 Specifically, (a) Is it possible to build, maintain and customize rule-based NER annotators that match the state-of-the-art results obtained using machine-learning techniques? [sent-53, score-0.179]

25 and (b) Can this be achieved with a reasonable amount of manual effort? [sent-54, score-0.039]

26 3 Contributions In this paper, we address the challenges mentioned above by (i) defining a taxonomy of the different types of customizations that a rule developer may perform when adapting to a new domain (Sec. [sent-56, score-0.302]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ner', 0.653), ('customization', 0.28), ('explainability', 0.28), ('customizing', 0.216), ('nerl', 0.21), ('annotators', 0.147), ('almaden', 0.14), ('transparency', 0.14), ('jansche', 0.12), ('viable', 0.108), ('abney', 0.1), ('sekine', 0.093), ('florian', 0.08), ('core', 0.079), ('recognition', 0.077), ('appealing', 0.076), ('domain', 0.072), ('adapting', 0.068), ('ra', 0.061), ('sacrificing', 0.06), ('gruhl', 0.06), ('ase', 0.06), ('compromising', 0.06), ('lessons', 0.06), ('maynard', 0.06), ('minkov', 0.06), ('required', 0.059), ('developer', 0.054), ('persons', 0.054), ('bender', 0.054), ('laura', 0.054), ('rigid', 0.054), ('road', 0.054), ('li', 0.052), ('mccallum', 0.05), ('operations', 0.05), ('sio', 0.05), ('frederick', 0.05), ('webpages', 0.05), ('wc', 0.05), ('facilitating', 0.05), ('shivakumar', 0.05), ('effort', 0.048), ('identify', 0.047), ('harry', 0.047), ('implementing', 0.047), ('extractors', 0.047), ('emails', 0.047), ('exceptions', 0.047), ('simplifies', 0.047), ('rule', 0.045), ('organizations', 0.044), ('jose', 0.044), ('bikel', 0.042), ('complex', 0.041), ('articles', 0.041), ('traditionally', 0.04), ('title', 0.04), ('manual', 0.039), ('informal', 0.038), ('sports', 0.038), ('acquiring', 0.038), ('blogs', 0.038), ('zhai', 0.038), ('discussing', 0.037), ('taxonomy', 0.037), ('statement', 0.037), ('belonging', 0.037), ('published', 0.037), ('blitzer', 0.037), ('literature', 0.036), ('tasks', 0.036), ('mentions', 0.035), ('portions', 0.034), ('finkel', 0.034), ('etzioni', 0.034), ('san', 0.034), ('hybrid', 0.032), ('maintain', 0.032), ('covering', 0.032), ('locations', 0.032), ('zhu', 0.032), ('ideally', 0.032), ('internal', 0.031), ('location', 0.031), ('extremely', 0.031), ('achieving', 0.029), ('fall', 0.029), ('adapted', 0.029), ('building', 0.029), ('document', 0.028), ('solutions', 0.028), ('ibm', 0.028), ('organization', 0.028), ('annotator', 0.027), ('proce', 0.027), ('explore', 0.027), ('rules', 0.027), ('jiang', 0.026), ('defining', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Frederick Reiss ; Shivakumar Vaithyanathan

2 0.16817994 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models

Author: Avihai Mejer ; Koby Crammer

Abstract: Confidence-Weighted linear classifiers (CW) and its successors were shown to perform well on binary and multiclass NLP problems. In this paper we extend the CW approach for sequence learning and show that it achieves state-of-the-art performance on four noun phrase chucking and named entity recognition tasks. We then derive few algorithmic approaches to estimate the prediction’s correctness of each label in the output sequence. We show that our approach provides a reliable relative correctness information as it outperforms other alternatives in ranking label-predictions according to their error. We also show empirically that our methods output close to absolute estimation of error. Finally, we show how to use this information to improve active learning.

3 0.098148823 104 emnlp-2010-The Necessity of Combining Adaptation Methods

Author: Ming-Wei Chang ; Michael Connor ; Dan Roth

Abstract: Problems stemming from domain adaptation continue to plague the statistical natural language processing community. There has been continuing work trying to find general purpose algorithms to alleviate this problem. In this paper we argue that existing general purpose approaches usually only focus on one of two issues related to the difficulties faced by adaptation: 1) difference in base feature statistics or 2) task differences that can be detected with labeled data. We argue that it is necessary to combine these two classes of adaptation algorithms, using evidence collected through theoretical analysis and simulated and real-world data experiments. We find that the combined approach often outperforms the individual adaptation approaches. By combining simple approaches from each class of adaptation algorithm, we achieve state-of-the-art results for both Named Entity Recognition adaptation task and the Preposition Sense Disambiguation adaptation task. Second, we also show that applying an adaptation algorithm that finds shared representation between domains often impacts the choice in adaptation algorithm that makes use of target labeled data.

4 0.06935209 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

Author: Karthik Raghunathan ; Heeyoung Lee ; Sudarshan Rangarajan ; Nate Chambers ; Mihai Surdeanu ; Dan Jurafsky ; Christopher Manning

Abstract: Most coreference resolution models determine if two mentions are coreferent using a single function over a set of constraints or features. This approach can lead to incorrect decisions as lower precision features often overwhelm the smaller number of high precision ones. To overcome this problem, we propose a simple coreference architecture based on a sieve that applies tiers of deterministic coreference models one at a time from highest to lowest precision. Each tier builds on the previous tier’s entity cluster output. Further, our model propagates global information by sharing attributes (e.g., gender and number) across mentions in the same cluster. This cautious sieve guarantees that stronger features are given precedence over weaker ones and that each decision is made using all of the information available at the time. The framework is highly modular: new coreference modules can be plugged in without any change to the other modules. In spite of its simplicity, our approach outperforms many state-of-the-art supervised and unsupervised models on several standard corpora. This suggests that sievebased approaches could be applied to other NLP tasks.

5 0.052641839 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

Author: Yassine Benajiba ; Imed Zitouni

Abstract: The research question treated in this paper is centered on the idea of exploiting rich resources of one language to enhance the performance of a mention detection system of another one. We successfully achieve this goal by projecting information from one language to another via a parallel corpus. We examine the potential improvement using various degrees of linguistic information in a statistical framework and we show that the proposed technique is effective even when the target language model has access to a significantly rich feature set. Experimental results show up to 2.4F improvement in performance when the system has access to information obtained by projecting mentions from a resource-richlanguage mention detection system via a parallel corpus.

6 0.051008645 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

7 0.043299362 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

8 0.041458387 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

9 0.03480335 39 emnlp-2010-EMNLP 044

10 0.033522997 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

11 0.031556409 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

12 0.02836111 70 emnlp-2010-Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid

13 0.027692126 51 emnlp-2010-Function-Based Question Classification for General QA

14 0.027259959 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

15 0.027145231 20 emnlp-2010-Automatic Detection and Classification of Social Events

16 0.026557654 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

17 0.025349822 28 emnlp-2010-Collective Cross-Document Relation Extraction Without Labelled Data

18 0.024158634 114 emnlp-2010-Unsupervised Parse Selection for HPSG

19 0.023739241 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

20 0.023688173 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.097), (1, 0.06), (2, -0.01), (3, 0.062), (4, -0.041), (5, -0.101), (6, 0.126), (7, 0.106), (8, -0.035), (9, 0.042), (10, 0.041), (11, 0.061), (12, -0.047), (13, 0.137), (14, -0.008), (15, -0.03), (16, -0.048), (17, -0.27), (18, 0.04), (19, -0.125), (20, 0.056), (21, -0.098), (22, 0.154), (23, -0.017), (24, -0.214), (25, 0.109), (26, 0.03), (27, -0.058), (28, 0.12), (29, -0.129), (30, 0.026), (31, 0.003), (32, 0.126), (33, -0.159), (34, 0.185), (35, 0.041), (36, 0.282), (37, -0.041), (38, -0.088), (39, 0.033), (40, -0.008), (41, -0.019), (42, -0.027), (43, -0.113), (44, -0.037), (45, -0.157), (46, -0.108), (47, 0.137), (48, 0.099), (49, -0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97853076 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Frederick Reiss ; Shivakumar Vaithyanathan

2 0.64331132 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models

Author: Avihai Mejer ; Koby Crammer

3 0.34957892 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

Author: Mark Dredze ; Tim Oates ; Christine Piatko

Abstract: Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention – detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses Aodifst uannlaceb,e a dm eextraicm fpolre detecting tshhoifdts u sine sd Aatastreams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.

4 0.26873478 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution

Author: Karthik Raghunathan ; Heeyoung Lee ; Sudarshan Rangarajan ; Nate Chambers ; Mihai Surdeanu ; Dan Jurafsky ; Christopher Manning

5 0.19071458 104 emnlp-2010-The Necessity of Combining Adaptation Methods

Author: Ming-Wei Chang ; Michael Connor ; Dan Roth

6 0.17102438 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

7 0.162663 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

8 0.1361845 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

9 0.13192442 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

10 0.13113648 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

11 0.12412576 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

12 0.11964716 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

13 0.11850633 51 emnlp-2010-Function-Based Question Classification for General QA

14 0.11210193 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

15 0.11077584 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

16 0.10534304 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

17 0.10060779 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

18 0.10054317 61 emnlp-2010-Improving Gender Classification of Blog Authors

19 0.098959006 4 emnlp-2010-A Game-Theoretic Approach to Generating Spatial Descriptions

20 0.097969905 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.036), (10, 0.039), (12, 0.034), (29, 0.048), (46, 0.443), (52, 0.051), (56, 0.03), (62, 0.018), (66, 0.096), (72, 0.042), (76, 0.012), (79, 0.015), (89, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.69841629 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks

Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Frederick Reiss ; Shivakumar Vaithyanathan

2 0.27753627 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

Author: Minh-Thang Luong ; Preslav Nakov ; Min-Yen Kan

Abstract: We propose a language-independent approach for improving statistical machine translation for morphologically rich languages using a hybrid morpheme-word representation where the basic unit of translation is the morpheme, but word boundaries are respected at all stages of the translation process. Our model extends the classic phrase-based model by means of (1) word boundary-aware morpheme-level phrase extraction, (2) minimum error-rate training for a morpheme-level translation model using word-level BLEU, and (3) joint scoring with morpheme- and word-level language models. Further improvements are achieved by combining our model with the classic one. The evaluation on English to Finnish using Europarl (714K sentence pairs; 15.5M English words) shows statistically significant improvements over the classic model based on BLEU and human judgments.

3 0.27726668 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; David Stallard ; Prem Natarajan

Abstract: Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demon- strate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy.

4 0.27380407 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

5 0.27367195 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

Author: Samuel Brody

Abstract: We reveal a previously unnoticed connection between dependency parsing and statistical machine translation (SMT), by formulating the dependency parsing task as a problem of word alignment. Furthermore, we show that two well known models for these respective tasks (DMV and the IBM models) share common modeling assumptions. This motivates us to develop an alignment-based framework for unsupervised dependency parsing. The framework (which will be made publicly available) is flexible, modular and easy to extend. Using this framework, we implement several algorithms based on the IBM alignment models, which prove surprisingly effective on the dependency parsing task, and demonstrate the potential of the alignment-based approach.

6 0.27311313 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

7 0.26865053 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

8 0.26839638 104 emnlp-2010-The Necessity of Combining Adaptation Methods

9 0.26801139 51 emnlp-2010-Function-Based Question Classification for General QA

10 0.2680074 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

11 0.2668916 61 emnlp-2010-Improving Gender Classification of Blog Authors

12 0.26677629 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

13 0.2667678 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

14 0.26601744 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

15 0.26541483 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

16 0.26478592 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

17 0.26460347 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

18 0.26414427 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

19 0.26373661 36 emnlp-2010-Discriminative Word Alignment with a Function Word Reordering Model

20 0.2635695 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices