acl acl2011 acl2011-291 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yunyao Li ; Frederick Reiss ; Laura Chiticariu
Abstract: Frederick R. Reiss IBM Research - Almaden 650 Harry Road San Jose, CA 95120 frre i s @us . ibm . com s Laura Chiticariu IBM Research - Almaden 650 Harry Road San Jose, CA 95120 chit i us .ibm . com @ magnitude larger than classical IE corpora. An Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. SystemT facilitates the development of high quality complex annotators by providing a highly expressive language and an advanced development environment. It also includes a cost-based optimizer and a high-performance, flexible runtime with minimum memory footprint. We present SystemT as a useful resource that is freely available, and as an opportunity to promote research in building scalable and usable IE systems.
Reference: text
sentIndex sentText sentNum sentScore
1 SystemT: A Declarative Information Extraction System Yunyao Li IBM Research - Almaden 650 Harry Road San Jose, CA 95 120 yunyao l @us i . [sent-1, score-0.056]
2 com s Laura Chiticariu IBM Research - Almaden 650 Harry Road San Jose, CA 95120 chit i us . [sent-6, score-0.031]
3 com @ magnitude larger than classical IE corpora. [sent-8, score-0.031]
4 An Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. [sent-9, score-0.627]
5 This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. [sent-10, score-0.521]
6 SystemT facilitates the development of high quality complex annotators by providing a highly expressive language and an advanced development environment. [sent-11, score-0.104]
7 It also includes a cost-based optimizer and a high-performance, flexible runtime with minimum memory footprint. [sent-12, score-0.257]
8 1 Introduction Information extraction (IE) refers to the extraction of structured information from text documents. [sent-14, score-0.102]
9 In recent years, text analytics have become the driving force for many emerging enterprise applications such as compliance and data redaction. [sent-15, score-0.443]
10 In addition, the inclusion of text has also been increasingly important for many traditional enterprise applications such as business intelligence. [sent-16, score-0.321]
11 Not surprisingly, the use of information extraction has dramatically increased within the enterprise over the years. [sent-17, score-0.331]
12 While the traditional requirement of extraction quality remains critical, enterprise applications pose several two challenges to IE systems: 1. [sent-18, score-0.447]
13 Scalability: Enterprise applications operate over large volumes of data, often orders of 109 IE system should be able to operate at those scales without compromising its execution efficiency or memory consumption. [sent-19, score-0.351]
14 Therefore, the usability of an enterprise IE system in terms of ease of development and maintenance is crucial for ensuring healthy product cycle and timely handling of customer complains. [sent-22, score-0.497]
15 Traditionally, IE systems have been built from individual extraction components consisting of rules or machine learning models. [sent-23, score-0.119]
16 These individual components are then connected procedurally in a programming language such as C++, Perl or Java. [sent-24, score-0.061]
17 Such procedural logic towards IE cannot meet the increasing scalability and usability requirements in the en- terprise (Doan et al. [sent-25, score-0.275]
18 Three decades ago, the database community faced similar scalability and expressivity challenges in accessing structured information. [sent-28, score-0.268]
19 The community addressed these problems by introducing a relational algebra formalism and an associated declarative query language SQL. [sent-29, score-0.266]
20 Borrowing ideas from the database community, several systems (Doan and others, 2008; Bohannon and others, 2008; Jain et al. [sent-30, score-0.036]
21 , 2010) have been built in recent years taking an alternative declarative approach to information extraction. [sent-33, score-0.158]
22 Instead of using procedural logic to implement the extraction task, declarative IE systems separate the description of what to extract from how to extract it, allowing the IE developer to build complex extracPortlanPdr,o Ocre egdoin ,g sU oSAf t,h 2e1 A CJuLn-eH 2L0T1 2. [sent-34, score-0.374]
23 1c 12 S0y1s1te Amss Doecmiaotinosntr faotiron Cos,m papguetast 1io0n9a–l1 L1in4g,uistics Figure 1: Overview of SystemT tion programs without worrying about performance considerations. [sent-36, score-0.046]
24 In this demonstration, we showcase one such declarative IE system called SystemT, designed to address the scalability and usability challenges. [sent-37, score-0.394]
25 We illustrate how SystemT, currently deployed in a multitude of real-world applications and commercial products, can be used to develop and maintain IE annotators for enterprise applications. [sent-38, score-0.39]
26 The SystemT Development Environment supports the iterative process of constructing and refining rules for information extraction. [sent-45, score-0.035]
27 The rules are specified in a declarative language called AQL (F. [sent-46, score-0.22]
28 The Development Environment provides facilities for executing rules over a given corpus of representative documents and visualizing the results of the execution. [sent-49, score-0.059]
29 Once a developer is satisfied with the results that her rules produce on these documents, she can publish her annotator. [sent-50, score-0.108]
30 First, given an AQL annotator, there can be many possible graphs of operators, or execution plans, each of which faithfully implements the semantics of the annotator. [sent-52, score-0.186]
31 Some of the execution plans are much more efficient than others. [sent-53, score-0.254]
32 The SystemT Optimizer explores the space of the possible execution plans to choose the most efficient one. [sent-54, score-0.289]
33 This execution plan is then given to the SystemT Runtime to instantiate the corresponding physical operators. [sent-55, score-0.213]
34 Once the physical operators are instantiated, the Figure 2: An AQL program for a PersonPhone task. [sent-56, score-0.139]
35 SystemT Runtime feeds one document at a time through the graph of physical operators and outputs a stream of annotated documents. [sent-57, score-0.168]
36 The decoupling of the Development and Runtime environments is essential for the flexibility of the system. [sent-58, score-0.03]
37 It facilitates the incorporating of various sophisticated tools to enable annotator development without sacrificing runtime performance. [sent-59, score-0.352]
38 Furthermore, the separation permits the SystemT Runtime to be embedded into larger applications with minimum memory footprint. [sent-60, score-0.077]
39 Next, we dis- cuss individual components of SystemT in more details (Sections 3 6), and summarize our experience with the system in a variety of enterprise applications (Section 7). [sent-61, score-0.354]
40 – 3 The Extraction Language In SystemT, developers express an information extraction program using a language called AQL. [sent-62, score-0.104]
41 AQL is a declarative relational language similar in syntax to the database language SQL, which was chosen as a basis for our language due to its expressivity and familiarity. [sent-63, score-0.291]
42 An AQL program (or an AQL annotator) consists of a set of AQL rules. [sent-64, score-0.026]
43 In this section, we describe the AQL language and its underlying algebraic operators. [sent-65, score-0.026]
44 In Section 4, we explain how the SystemT optimizer explores a large space of possible execution plans for an AQL annotator and chooses one that is most efficient. [sent-66, score-0.472]
45 1 AQL Figure 2 illustrates a (very) simplistic annotator of relationships between persons and their phone number. [sent-68, score-0.19]
46 At a high-level, the annotator identifies person names using a simple dictionary of first names, and phone numbers using a regular expression. [sent-69, score-0.191]
47 It then identifies pairs of Person and Phone annotations, where the latter follows the 110 former within 0 to 5 tokens, and marks the corresponding region of text as a PersonPhoneAll annotation. [sent-70, score-0.083]
48 The final output PersonPhone is constructed by removing overlapping PersonPhoneAll annotations. [sent-71, score-0.029]
49 AQL operates over a simple relational data model with three data types: span, tuple, and view. [sent-72, score-0.047]
50 In this data model, a span is a region of text within a document identified by its “begin” and “end” positions, while a tuple is a list of spans of fixed size. [sent-73, score-0.158]
51 As such, a view is the basic building block in AQL: it consists of a logical description of a set of tuples in terms of the document text, or the content of other views. [sent-76, score-0.064]
52 The input to the annotator is a special view called Document containing a single tuple with the document text. [sent-77, score-0.245]
53 The AQL annotator tags some views as output views, which specify the annotation types that are the final results of the annotator. [sent-78, score-0.108]
54 The example in Figure 2 illustrates two of the basic constructs of AQL. [sent-79, score-0.024]
55 The ext ract statement specifies basic character-level extraction primitives, such as regular expressions or dictionaries (i. [sent-80, score-0.22]
56 , gazetteers), that are applied directly to the docu- ment, or a region thereof. [sent-82, score-0.058]
57 The se lect statement is similar to the corresponding SQL statement, but contains an additional cons olidate on clause for resolving overlapping annotations, along with an extensive collection of text-specific predicates. [sent-83, score-0.132]
58 To keep rules compact, AQL also allows a shorthand pattern notation similar to the syntax of the CPSL grammar standard (Appelt and Onyshkevych, 1998). [sent-84, score-0.057]
wordName wordTfidf (topN-words)
[('systemt', 0.547), ('aql', 0.517), ('enterprise', 0.28), ('ie', 0.195), ('declarative', 0.158), ('execution', 0.158), ('runtime', 0.146), ('annotator', 0.108), ('harry', 0.103), ('usability', 0.102), ('plans', 0.096), ('almaden', 0.091), ('scalability', 0.079), ('optimizer', 0.075), ('doan', 0.069), ('onphoneal', 0.069), ('personphone', 0.069), ('personphoneall', 0.069), ('ract', 0.069), ('sql', 0.069), ('jose', 0.069), ('road', 0.067), ('procedural', 0.061), ('ibm', 0.059), ('operators', 0.058), ('phone', 0.058), ('region', 0.058), ('chiticariu', 0.056), ('yunyao', 0.056), ('physical', 0.055), ('statement', 0.052), ('extraction', 0.051), ('analytics', 0.05), ('expressivity', 0.05), ('developer', 0.048), ('ext', 0.048), ('emerging', 0.048), ('relational', 0.047), ('environment', 0.046), ('tuple', 0.046), ('deployed', 0.043), ('applications', 0.041), ('challenges', 0.04), ('facilitates', 0.04), ('san', 0.038), ('community', 0.037), ('memory', 0.036), ('database', 0.036), ('pose', 0.035), ('explores', 0.035), ('rules', 0.035), ('view', 0.033), ('logic', 0.033), ('components', 0.033), ('development', 0.032), ('operate', 0.031), ('com', 0.031), ('document', 0.031), ('patte', 0.03), ('timely', 0.03), ('compromising', 0.03), ('decoupling', 0.03), ('overlapping', 0.029), ('reiss', 0.028), ('healthy', 0.028), ('faithfully', 0.028), ('appelt', 0.028), ('lect', 0.028), ('showcase', 0.028), ('primitives', 0.028), ('procedurally', 0.028), ('called', 0.027), ('algebraic', 0.026), ('krishnamurthy', 0.026), ('sacrificing', 0.026), ('ago', 0.026), ('accessing', 0.026), ('perl', 0.026), ('multitude', 0.026), ('ca', 0.026), ('program', 0.026), ('maintenance', 0.025), ('pers', 0.025), ('publish', 0.025), ('identifies', 0.025), ('illustrates', 0.024), ('driving', 0.024), ('algebra', 0.024), ('feeds', 0.024), ('volumes', 0.024), ('executing', 0.024), ('span', 0.023), ('cons', 0.023), ('doecmiaotinosntr', 0.023), ('egdoin', 0.023), ('faotiron', 0.023), ('osaf', 0.023), ('papguetast', 0.023), ('gazetteers', 0.023), ('shorthand', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 291 acl-2011-SystemT: A Declarative Information Extraction System
Author: Yunyao Li ; Frederick Reiss ; Laura Chiticariu
Abstract: Frederick R. Reiss IBM Research - Almaden 650 Harry Road San Jose, CA 95120 frre i s @us . ibm . com s Laura Chiticariu IBM Research - Almaden 650 Harry Road San Jose, CA 95120 chit i us .ibm . com @ magnitude larger than classical IE corpora. An Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. SystemT facilitates the development of high quality complex annotators by providing a highly expressive language and an advanced development environment. It also includes a cost-based optimizer and a high-performance, flexible runtime with minimum memory footprint. We present SystemT as a useful resource that is freely available, and as an opportunity to promote research in building scalable and usable IE systems.
2 0.055970594 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
Author: Jonathan H. Clark ; Chris Dyer ; Alon Lavie ; Noah A. Smith
Abstract: In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on held-out data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instability—an extraneous variable that is seldom controlled for—on experimental outcomes, and make recommendations for reporting results more accurately.
3 0.044593666 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
Author: Harr Chen ; Edward Benson ; Tahira Naseem ; Regina Barzilay
Abstract: We present a novel approach to discovering relations and their instantiations from a collection of documents in a single domain. Our approach learns relation types by exploiting meta-constraints that characterize the general qualities of a good relation in any domain. These constraints state that instances of a single relation should exhibit regularities at multiple levels of linguistic structure, including lexicography, syntax, and document-level context. We capture these regularities via the structure of our probabilistic model as well as a set of declaratively-specified constraints enforced during posterior inference. Across two domains our approach successfully recovers hidden relation structure, comparable to or outperforming previous state-of-the-art approaches. Furthermore, we find that a small , set of constraints is applicable across the domains, and that using domain-specific constraints can further improve performance. 1
4 0.042252604 293 acl-2011-Template-Based Information Extraction without the Templates
Author: Nathanael Chambers ; Dan Jurafsky
Abstract: Standard algorithms for template-based information extraction (IE) require predefined template schemas, and often labeled data, to learn to extract their slot fillers (e.g., an embassy is the Target of a Bombing template). This paper describes an approach to template-based IE that removes this requirement and performs extraction without knowing the template structure in advance. Our algorithm instead learns the template structure automatically from raw text, inducing template schemas as sets of linked events (e.g., bombings include detonate, set off, and destroy events) associated with semantic roles. We also solve the standard IE task, using the induced syntactic patterns to extract role fillers from specific documents. We evaluate on the MUC-4 terrorism dataset and show that we induce template structure very similar to handcreated gold structure, and we extract role fillers with an F1 score of .40, approaching the performance of algorithms that require full knowledge of the templates.
5 0.034714092 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
Author: Raphael Hoffmann ; Congle Zhang ; Xiao Ling ; Luke Zettlemoyer ; Daniel S. Weld
Abstract: Information extraction (IE) holds the promise of generating a large-scale knowledge base from the Web’s natural language text. Knowledge-based weak supervision, using structured data to heuristically label a training corpus, works towards this goal by enabling the automated learning of a potentially unbounded number of relation extractors. Recently, researchers have developed multiinstance learning algorithms to combat the noisy training data that can come from heuristic labeling, but their models assume relations are disjoint — for example they cannot extract the pair Founded ( Jobs Apple ) and CEO-o f ( Jobs Apple ) . , , This paper presents a novel approach for multi-instance learning with overlapping relations that combines a sentence-level extrac- , tion model with a simple, corpus-level component for aggregating the individual facts. We apply our model to learn extractors for NY Times text using weak supervision from Freebase. Experiments show that the approach runs quickly and yields surprising gains in accuracy, at both the aggregate and sentence level.
6 0.033209711 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
7 0.031520341 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges
8 0.030234154 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations
9 0.028955825 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search
10 0.02700695 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
11 0.026391558 194 acl-2011-Language Use: What can it tell us?
12 0.023949469 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents
13 0.02387085 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
14 0.022731729 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
15 0.022671111 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts
16 0.022132987 11 acl-2011-A Fast and Accurate Method for Approximate String Search
17 0.021972012 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
18 0.021001749 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal
19 0.020985059 28 acl-2011-A Statistical Tree Annotator and Its Applications
20 0.020877177 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life
topicId topicWeight
[(0, 0.066), (1, 0.008), (2, -0.022), (3, 0.016), (4, -0.001), (5, 0.012), (6, -0.017), (7, -0.02), (8, -0.027), (9, -0.004), (10, -0.018), (11, 0.007), (12, -0.001), (13, 0.039), (14, -0.023), (15, -0.025), (16, 0.012), (17, -0.02), (18, -0.0), (19, -0.007), (20, 0.01), (21, 0.016), (22, 0.024), (23, -0.0), (24, -0.011), (25, -0.002), (26, 0.018), (27, -0.007), (28, 0.033), (29, -0.003), (30, 0.002), (31, 0.035), (32, 0.04), (33, 0.008), (34, -0.001), (35, 0.001), (36, -0.018), (37, 0.023), (38, 0.001), (39, 0.032), (40, 0.032), (41, -0.028), (42, 0.054), (43, 0.024), (44, 0.013), (45, 0.045), (46, -0.014), (47, -0.021), (48, 0.024), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.88293368 291 acl-2011-SystemT: A Declarative Information Extraction System
Author: Yunyao Li ; Frederick Reiss ; Laura Chiticariu
Abstract: Frederick R. Reiss IBM Research - Almaden 650 Harry Road San Jose, CA 95120 frre i s @us . ibm . com s Laura Chiticariu IBM Research - Almaden 650 Harry Road San Jose, CA 95120 chit i us .ibm . com @ magnitude larger than classical IE corpora. An Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. SystemT facilitates the development of high quality complex annotators by providing a highly expressive language and an advanced development environment. It also includes a cost-based optimizer and a high-performance, flexible runtime with minimum memory footprint. We present SystemT as a useful resource that is freely available, and as an opportunity to promote research in building scalable and usable IE systems.
2 0.5153777 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
Author: Raphael Hoffmann ; Congle Zhang ; Xiao Ling ; Luke Zettlemoyer ; Daniel S. Weld
Abstract: Information extraction (IE) holds the promise of generating a large-scale knowledge base from the Web’s natural language text. Knowledge-based weak supervision, using structured data to heuristically label a training corpus, works towards this goal by enabling the automated learning of a potentially unbounded number of relation extractors. Recently, researchers have developed multiinstance learning algorithms to combat the noisy training data that can come from heuristic labeling, but their models assume relations are disjoint — for example they cannot extract the pair Founded ( Jobs Apple ) and CEO-o f ( Jobs Apple ) . , , This paper presents a novel approach for multi-instance learning with overlapping relations that combines a sentence-level extrac- , tion model with a simple, corpus-level component for aggregating the individual facts. We apply our model to learn extractors for NY Times text using weak supervision from Freebase. Experiments show that the approach runs quickly and yields surprising gains in accuracy, at both the aggregate and sentence level.
3 0.49360019 121 acl-2011-Event Discovery in Social Media Feeds
Author: Edward Benson ; Aria Haghighi ; Regina Barzilay
Abstract: We present a novel method for record extraction from social streams such as Twitter. Unlike typical extraction setups, these environments are characterized by short, one sentence messages with heavily colloquial speech. To further complicate matters, individual messages may not express the full relation to be uncovered, as is often assumed in extraction tasks. We develop a graphical model that addresses these problems by learning a latent set of records and a record-message alignment simultaneously; the output of our model is a set of canonical records, the values of which are consistent with aligned messages. We demonstrate that our approach is able to accurately induce event records from Twitter messages, evaluated against events from a local city guide. Our method achieves significant error reduction over baseline methods.1
4 0.46778539 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
Author: Harr Chen ; Edward Benson ; Tahira Naseem ; Regina Barzilay
Abstract: We present a novel approach to discovering relations and their instantiations from a collection of documents in a single domain. Our approach learns relation types by exploiting meta-constraints that characterize the general qualities of a good relation in any domain. These constraints state that instances of a single relation should exhibit regularities at multiple levels of linguistic structure, including lexicography, syntax, and document-level context. We capture these regularities via the structure of our probabilistic model as well as a set of declaratively-specified constraints enforced during posterior inference. Across two domains our approach successfully recovers hidden relation structure, comparable to or outperforming previous state-of-the-art approaches. Furthermore, we find that a small , set of constraints is applicable across the domains, and that using domain-specific constraints can further improve performance. 1
5 0.46755117 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories
Author: Truc Vien T. Nguyen ; Alessandro Moschitti
Abstract: In this paper, we extend distant supervision (DS) based on Wikipedia for Relation Extraction (RE) by considering (i) relations defined in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. We show that training data constituted by sentences containing pairs of named entities in target relations is enough to produce reliable supervision. Our experiments with state-of-the-art relation extraction models, trained on the above data, show a meaningful F1 of 74.29% on a manually annotated test set: this highly improves the state-of-art in RE using DS. Additionally, our end-to-end experiments demonstrated that our extractors can be applied to any general text document.
6 0.46718886 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
7 0.46704128 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
8 0.46488616 125 acl-2011-Exploiting Readymades in Linguistic Creativity: A System Demonstration of the Jigsaw Bard
9 0.46411234 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
10 0.45846632 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
11 0.44897914 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
12 0.44863802 11 acl-2011-A Fast and Accurate Method for Approximate String Search
13 0.44402894 303 acl-2011-Tier-based Strictly Local Constraints for Phonology
14 0.44387931 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling
15 0.43685845 239 acl-2011-P11-5002 k2opt.pdf
16 0.42881617 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
17 0.42538604 174 acl-2011-Insights from Network Structure for Text Mining
18 0.42415711 285 acl-2011-Simple supervised document geolocation with geodesic grids
19 0.41780579 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents
20 0.41779995 187 acl-2011-Jointly Learning to Extract and Compress
topicId topicWeight
[(1, 0.011), (5, 0.042), (13, 0.024), (16, 0.339), (17, 0.031), (26, 0.029), (37, 0.043), (39, 0.043), (41, 0.055), (55, 0.034), (59, 0.068), (72, 0.02), (91, 0.045), (96, 0.1), (97, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.76979256 291 acl-2011-SystemT: A Declarative Information Extraction System
Author: Yunyao Li ; Frederick Reiss ; Laura Chiticariu
Abstract: Frederick R. Reiss IBM Research - Almaden 650 Harry Road San Jose, CA 95120 frre i s @us . ibm . com s Laura Chiticariu IBM Research - Almaden 650 Harry Road San Jose, CA 95120 chit i us .ibm . com @ magnitude larger than classical IE corpora. An Emerging text-intensive enterprise applications such as social analytics and semantic search pose new challenges of scalability and usability to Information Extraction (IE) systems. This paper presents SystemT, a declarative IE system that addresses these challenges and has been deployed in a wide range of enterprise applications. SystemT facilitates the development of high quality complex annotators by providing a highly expressive language and an advanced development environment. It also includes a cost-based optimizer and a high-performance, flexible runtime with minimum memory footprint. We present SystemT as a useful resource that is freely available, and as an opportunity to promote research in building scalable and usable IE systems.
2 0.65384811 9 acl-2011-A Cross-Lingual ILP Solution to Zero Anaphora Resolution
Author: Ryu Iida ; Massimo Poesio
Abstract: We present an ILP-based model of zero anaphora detection and resolution that builds on the joint determination of anaphoricity and coreference model proposed by Denis and Baldridge (2007), but revises it and extends it into a three-way ILP problem also incorporating subject detection. We show that this new model outperforms several baselines and competing models, as well as a direct translation of the Denis / Baldridge model, for both Italian and Japanese zero anaphora. We incorporate our model in complete anaphoric resolvers for both Italian and Japanese, showing that our approach leads to improved performance also when not used in isolation, provided that separate classifiers are used for zeros and for ex- plicitly realized anaphors.
3 0.5683378 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task
Author: William Coster ; David Kauchak
Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.
4 0.56403071 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
Author: Dirk Hovy ; Chunliang Zhang ; Eduard Hovy ; Anselmo Penas
Abstract: Learning by Reading (LbR) aims at enabling machines to acquire knowledge from and reason about textual input. This requires knowledge about the domain structure (such as entities, classes, and actions) in order to do inference. We present a method to infer this implicit knowledge from unlabeled text. Unlike previous approaches, we use automatically extracted classes with a probability distribution over entities to allow for context-sensitive labeling. From a corpus of 1.4m sentences, we learn about 250k simple propositions about American football in the form of predicateargument structures like “quarterbacks throw passes to receivers”. Using several statistical measures, we show that our model is able to generalize and explain the data statistically significantly better than various baseline approaches. Human subjects judged up to 96.6% of the resulting propositions to be sensible. The classes and probabilistic model can be used in textual enrichment to improve the performance of LbR end-to-end systems.
5 0.51740962 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification
Author: Or Biran ; Samuel Brody ; Noemie Elhadad
Abstract: We present a method for lexical simplification. Simplification rules are learned from a comparable corpus, and the rules are applied in a context-aware fashion to input sentences. Our method is unsupervised. Furthermore, it does not require any alignment or correspondence among the complex and simple corpora. We evaluate the simplification according to three criteria: preservation of grammaticality, preservation of meaning, and degree of simplification. Results show that our method outperforms an established simplification baseline for both meaning preservation and simplification, while maintaining a high level of grammaticality.
6 0.41389903 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
7 0.4121049 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
8 0.41084933 23 acl-2011-A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models
9 0.40937254 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
10 0.40780157 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
11 0.40772808 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
12 0.40733743 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
13 0.40537393 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
14 0.40461424 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
15 0.40420693 178 acl-2011-Interactive Topic Modeling
16 0.40310538 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
17 0.40232533 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
18 0.40199009 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
19 0.40140042 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
20 0.4008714 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts