acl acl2010 acl2010-222 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Sriram Raghavan ; Frederick Reiss ; Shivakumar Vaithyanathan
Abstract: As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. SystemT uses a declarative rule language, AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. We compare SystemT’s approach against cascading grammars, both theoretically and with a thorough experimental evaluation. Our results show that SystemT can deliver result quality comparable to the state-of-the- art and an order of magnitude higher annotation throughput.
Reference: text
sentIndex sentText sentNum sentScore
1 com – Abstract As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. [sent-5, score-0.106]
2 In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. [sent-6, score-0.398]
3 SystemT uses a declarative rule language, AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. [sent-7, score-0.427]
4 We compare SystemT’s approach against cascading grammars, both theoretically and with a thorough experimental evaluation. [sent-8, score-0.114]
5 Our results show that SystemT can deliver result quality comparable to the state-of-the- art and an order of magnitude higher annotation throughput. [sent-9, score-0.023]
6 This increase, combined with the inclusion of text into traditional applications like Business Intelligence, has dramatically increased the use of information extraction (IE) within the enterprise. [sent-11, score-0.031]
7 While the traditional requirement of extraction quality remains critical, enterprise applications also demand efficiency, transparency, customizability and maintainability. [sent-12, score-0.085]
8 , 2004) were predominantly based on the cascading grammar formalism exemplified by the Common Pattern Specification Language (CPSL) specification (Appelt and Onyshkevych, 1998). [sent-17, score-0.23]
9 In CPSL, the input text is viewed as a sequence of annotations, and extraction rules are written as pattern/action rules over the lexical features of these annotations. [sent-18, score-0.203]
10 In a single phase of the grammar, a set of rules are evaluated in a left-to-right fashion over the input annotations. [sent-19, score-0.224]
11 Multiple grammar phases are cascaded together, with the evaluation proceeding in a bottom-up fashion. [sent-20, score-0.088]
12 First, the expressivity of CPSL falls short when used for complex IE tasks over increasingly pervasive informal text (emails, blogs, discussion forums etc. [sent-23, score-0.227]
13 Second, the rigid evaluation order imposed in these systems has significant performance implications. [sent-27, score-0.066]
14 Three decades ago, the database community faced similar expressivity and efficiency challenges in accessing structured information. [sent-28, score-0.288]
15 The community addressed these problems by introducing a relational algebra formalism and an associated declarative query language SQL. [sent-29, score-0.377]
16 , 1981) demonstrated how the expressivity of SQL can be efficiently realized in practice by means of a query optimizer that translates an SQL query into an optimized query execution plan. [sent-31, score-0.482]
17 Borrowing ideas from the database community, we have developed SystemT, a declarative IE system based on an algebraic framework, to address both expressivity and performance issues. [sent-32, score-0.479]
18 In SystemT, extraction rules are expressed in a declarative language called AQL. [sent-33, score-0.205]
19 At compilation time, 128 Proce dinUgsp osfa tlhae, 4S8wthed Aen n,u 1a1l-1 M6e Jeutilnyg 2 o0f1 t0h. [sent-34, score-0.022]
20 c As2s0o1c0ia Atisosnoc foiart Cionom fopru Ctaotmiopnuatla Lti on gaulis Lti cnsg,u piasgtiecs 128–137, Gazetteers containing first names and last names P2R3 ( {Last} {Token. [sent-36, score-0.068]
21 The SystemT optimizer then picks a fast execution plan from many logically equivalent plans. [sent-40, score-0.14]
22 We formally demonstrate the superiority of AQL and SystemT in terms of both expressivity and efficiency (Section 4). [sent-42, score-0.232]
23 Specifically, we show that 1) the expressivity of AQL is a strict superset of CPSL grammars not using external functions and 2) the search space explored by the SystemT optimizer includes operator graphs corresponding to efficient finite state transducer implementations. [sent-43, score-0.437]
24 Finally, we present an extensive experimental evaluation that validates that high-quality annotators can be developed with SystemT, and that their runtime performance is an order of magnitude better when compared to annotators developed with a state-of-the-art grammar-based IE system (Section 5). [sent-44, score-0.043]
25 2 Grammar-based Systems and CPSL A cascading grammar consists of a sequence of phases, each of which consists of one or more rules. [sent-45, score-0.195]
26 Each phase applies its rules from left to right over an input sequence of annotations and generates an output sequence of annotations that the next phase consumes. [sent-46, score-0.531]
27 Most cascading grammar systems today adhere to the CPSL standard. [sent-47, score-0.193]
28 1shows a sample CPSL grammar that identifies person names from text in two phases. [sent-49, score-0.197]
29 The first phase, P1, operates over the results of the tok1A trial version is available at http://www. [sent-50, score-0.033]
30 The second phase, P2, identifies complete names using the results of phase P1. [sent-54, score-0.195]
31 2), one would expect that to match “Mark Scott” and “Howard Smith” as Person. [sent-56, score-0.03]
32 2(a), the grammar actually finds three Person annotations, instead of two. [sent-58, score-0.057]
33 CPSL has several limitations that lead to such discrepancies: L1. [sent-59, score-0.027]
34 In a CPSL grammar, each phase operates on a sequence of annotations from left to right. [sent-61, score-0.254]
35 If the input annotations to a phase may overlap with each other, the CPSL engine must drop some of them to create a nonoverlapping sequence. [sent-62, score-0.227]
36 Consequently, no Caps annotations are output by phase P1. [sent-66, score-0.197]
37 CPSL specifies that, for each input annotation, only one rule can actually match. [sent-69, score-0.098]
38 When multiple rules match at the same start position, the following tie-breaker conditions are applied (in order): (a) the rule match- ing the most annotations in the input stream; (b) the rule with highest priority; and (c) the rule declared earlier in the grammar. [sent-70, score-0.341]
39 2(a), phase P1 only identifies “Scott” as a First. [sent-73, score-0.161]
40 Matching priority causes the grammar to skip the corresponding match for “Scott” as a Last. [sent-74, score-0.155]
41 Consequently, phase P2 fails to identify “Mark Scott” as one single Person. [sent-75, score-0.135]
42 It is not possible to express rules that compare annotations overlapping with each other. [sent-78, score-0.144]
43 , “Identify 129 Caps Figu[ArO-ZeIunt]p{\3uwt :T|-u}Rp+le 12guDlRoacreugm Exe ntpreSspanio12Snw…ceoEwt,xil…mreactMiornkOpeator words that are both capitalized and present in the FirstGaz gazetteer” or “Identify Person annotations that occur within an EmailAddress”. [sent-81, score-0.062]
44 Extensions to CPSL In order to address the above limitations, several extensions to CPSL have been proposed in JAPE, AFst and XTDL (Cunningham et al. [sent-82, score-0.046]
45 The extensions are summarized as below, where each solution Si corresponds to limitation Li. [sent-85, score-0.046]
46 Grammar rules are allowed to operate on graphs aomf input aunlenso ataretio anlslo iwn eJdA PtoE o apnedra AteF sont. [sent-87, score-0.147]
47 id JeAs PthEe nCtProSdLu’cs matching priority a rnedg tmhuess allows more flexibility when multiple rules match at the same starting position. [sent-90, score-0.227]
48 The rule part of a pattern has been expanded eto r uallelow pa more expressivity i bne JAPE, AFst and XTDL. [sent-92, score-0.252]
49 2(b) illustrates how the above extensions help in identifying the correct matches ‘Mark Scott’ and ‘Howard Smith’ in JAPE. [sent-94, score-0.07]
50 Phase P1 uses a matching regime (denoted by Brill) that allows multiple rules to match at the same starting position, and phase P2 uses CPSL’s matching priority, Appelt. [sent-95, score-0.33]
51 3 SystemT SystemT is a declarative IE system based on an algebraic framework. [sent-96, score-0.241]
52 In SystemT, developers write rules in a language called AQL. [sent-97, score-0.059]
53 The system then generates a graph of operators that implement the semantics of the AQL rules. [sent-98, score-0.081]
54 This decoupling allows for greater rule expressivity, because the rule language is not constrained by the need to compile to a finite state transducer. [sent-99, score-0.114]
55 Likewise, the decoupled approach leads to greater flexibility in choosing an efficient execution strategy, because many possible operator graphs may exist for the same AQL annotator. [sent-100, score-0.236]
56 In the rest of the section, we describe the parts of SystemT, starting with the algebraic formalism behind SystemT’s operators. [sent-101, score-0.159]
57 1 Algebraic Foundation of SystemT SystemT executes IE rules using graphs of operators. [sent-103, score-0.117]
58 The formal definition of these operators takes the form of an algebra that is similar to the relational algebra, but with extensions for text processing. [sent-104, score-0.27]
59 The algebra operates over a simple relational data model with three data types: span, tuple, and relation. [sent-105, score-0.201]
60 In this data model, a span is a region of text within a document identified by its “begin” and “end” positions; a tuple is a fixed-size list of spans. [sent-106, score-0.072]
61 A relation is a multiset of tuples, where every tuple in the relation must be of the same size. [sent-107, score-0.047]
62 Each operator in our algebra implements a single basic atomic IE operation, producing and consuming sets of tuples. [sent-108, score-0.275]
63 3 illustrates the regular expression extraction operator in the algebra, which performs character-level regular expression matching. [sent-110, score-0.277]
64 Overall, the algebra contains 12 different operators, a full description of which can be found in (Reiss et al. [sent-111, score-0.115]
65 The following four operators are necessary to understand the examples in this paper: • The Extract operator (E) performs characterlTehveel E operations esruactoh as regular expression taenrddictionary matching over text, creating a tuple • • • for each match. [sent-113, score-0.313]
66 The Select operator (σ) takes as input a set of tuples ealendct a predicate to) apply tso tnhpeu tuples. [sent-114, score-0.278]
67 The Join operator (⊲⊳) takes as input two sets Tofh tuples aonpde a predicate etos apply utot pairs otfs tuples from the input sets. [sent-116, score-0.444]
68 It outputs all pairs of input tuples that satisfy the predicate. [sent-117, score-0.189]
69 The consolidate operator (Ω) takes as input a Tseht eo cfo tuples aantde tohpee rinatdoerx ( Ωof) a particular cuotl aumn in those tuples. [sent-118, score-0.342]
70 It removes selected overlapping spans from the indicated column, according to the specified policy. [sent-119, score-0.053]
71 2 AQL Extraction rules in SystemT are written in AQL, a declarative relational language similar in syntax to the database language SQL. [sent-121, score-0.259]
72 We chose SQL as a basis for our language due to its expressivity and its familiarity. [sent-122, score-0.206]
73 The expressivity of SQL, which consists of first-order logic predicates 130 Figure 4: Person annotator as AQL query over sets of tuples, is well-documented and wellunderstood (Codd, 1990). [sent-123, score-0.243]
74 As SQL is the primary interface to most relational database systems, the language’s syntax and semantics are common knowledge among enterprise application programmers. [sent-124, score-0.164]
75 Similar to SQL terminology, we call a collection of AQL rules an AQL query. [sent-125, score-0.059]
76 As can be seen, the basic building block of AQL is a view: A logical description of a set of tuples in terms of either the document text (denoted by a special view called Document) or the contents of other views. [sent-128, score-0.222]
77 The output view statement indicates that the tuples in a view are part of the final results of the annotator. [sent-130, score-0.283]
78 4 also illustrates three of the basic constructs that can be used to define a view. [sent-132, score-0.045]
79 • The extract statement specifies basic cThhaeracter-level extraction primitives to be applied directly to a tuple. [sent-133, score-0.163]
80 • The select statement is similar to the SQL select statement but it contains an additional consolidate on clause, along with an extensive collection of text-specific predicates. [sent-134, score-0.178]
81 • The union all statement merges the outputs Tofh one or more s el ect or extract statements. [sent-135, score-0.09]
82 To keep rules compact, AQL also provides a shorthand sequence pattern notation similar to the syntax of CPSL. [sent-136, score-0.083]
wordName wordTfidf (topN-words)
[('systemt', 0.555), ('cpsl', 0.404), ('aql', 0.379), ('expressivity', 0.206), ('sql', 0.155), ('tuples', 0.136), ('ie', 0.136), ('phase', 0.135), ('algebraic', 0.126), ('algebra', 0.115), ('declarative', 0.115), ('cascading', 0.114), ('operator', 0.112), ('caps', 0.101), ('scott', 0.092), ('jape', 0.088), ('optimizer', 0.081), ('person', 0.08), ('howard', 0.076), ('priority', 0.068), ('statement', 0.067), ('rigid', 0.066), ('annotations', 0.062), ('execution', 0.059), ('rules', 0.059), ('grammar', 0.057), ('operators', 0.056), ('enterprise', 0.054), ('relational', 0.053), ('lookup', 0.051), ('afst', 0.05), ('capslas', 0.05), ('drozdzynski', 0.05), ('reiss', 0.05), ('cunningham', 0.049), ('tuple', 0.047), ('rule', 0.046), ('extensions', 0.046), ('gazetteer', 0.044), ('tofh', 0.044), ('consolidate', 0.044), ('matching', 0.043), ('boguraev', 0.041), ('view', 0.04), ('graphs', 0.038), ('query', 0.037), ('smith', 0.034), ('names', 0.034), ('formalism', 0.033), ('operates', 0.033), ('database', 0.032), ('extraction', 0.031), ('phases', 0.031), ('match', 0.03), ('input', 0.03), ('removes', 0.03), ('expression', 0.028), ('regular', 0.027), ('limitations', 0.027), ('flexibility', 0.027), ('implements', 0.027), ('identifies', 0.026), ('specification', 0.026), ('efficiency', 0.026), ('semantics', 0.025), ('document', 0.025), ('translates', 0.025), ('community', 0.024), ('illustrates', 0.024), ('sequence', 0.024), ('magnitude', 0.023), ('overlapping', 0.023), ('token', 0.023), ('mark', 0.023), ('outputs', 0.023), ('declared', 0.022), ('primitives', 0.022), ('tern', 0.022), ('compilation', 0.022), ('lossy', 0.022), ('sas', 0.022), ('decoupling', 0.022), ('tboin', 0.022), ('emails', 0.022), ('adhere', 0.022), ('firs', 0.022), ('transparency', 0.022), ('figu', 0.022), ('discrepancies', 0.022), ('tomorrow', 0.022), ('oor', 0.022), ('specifies', 0.022), ('increasingly', 0.021), ('ibm', 0.021), ('basic', 0.021), ('aomf', 0.02), ('executes', 0.02), ('validates', 0.02), ('regime', 0.02), ('cfo', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 222 acl-2010-SystemT: An Algebraic Approach to Declarative Information Extraction
Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Sriram Raghavan ; Frederick Reiss ; Shivakumar Vaithyanathan
Abstract: As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. SystemT uses a declarative rule language, AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. We compare SystemT’s approach against cascading grammars, both theoretically and with a thorough experimental evaluation. Our results show that SystemT can deliver result quality comparable to the state-of-the- art and an order of magnitude higher annotation throughput.
2 0.08664047 181 acl-2010-On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds
Author: Ashwin Ittoo ; Gosse Bouma
Abstract: An important relation in information extraction is the part-whole relation. Ontological studies mention several types of this relation. In this paper, we show that the traditional practice of initializing minimally-supervised algorithms with a single set that mixes seeds of different types fails to capture the wide variety of part-whole patterns and tuples. The results obtained with mixed seeds ultimately converge to one of the part-whole relation types. We also demonstrate that all the different types of part-whole relations can still be discovered, regardless of the type characterized by the initializing seeds. We performed our experiments with a state-ofthe-art information extraction algorithm. 1
3 0.046531931 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.
Author: Cristian Danescu-Niculescu-Mizil ; Lillian Lee
Abstract: Researchers in textual entailment have begun to consider inferences involving downward-entailing operators, an interesting and important class of lexical items that change the way inferences are made. Recent work proposed a method for learning English downward-entailing operators that requires access to a high-quality collection of negative polarity items (NPIs). However, English is one of the very few languages for which such a list exists. We propose the first approach that can be applied to the many languages for which there is no pre-existing high-precision database of NPIs. As a case study, we apply our method to Romanian and show that our method yields good results. Also, we perform a cross-linguistic analysis that suggests interesting connections to some findings in linguistic typology.
4 0.045048691 169 acl-2010-Learning to Translate with Source and Target Syntax
Author: David Chiang
Abstract: Statistical translation models that try to capture the recursive structure of language have been widely adopted over the last few years. These models make use of varying amounts of information from linguistic theory: some use none at all, some use information about the grammar of the target language, some use information about the grammar of the source language. But progress has been slower on translation models that are able to learn the relationship between the grammars of both the source and target language. We discuss the reasons why this has been a challenge, review existing attempts to meet this challenge, and show how some old and new ideas can be combined into a sim- ple approach that uses both source and target syntax for significant improvements in translation accuracy.
5 0.043834358 66 acl-2010-Compositional Matrix-Space Models of Language
Author: Sebastian Rudolph ; Eugenie Giesbrecht
Abstract: We propose CMSMs, a novel type of generic compositional models for syntactic and semantic aspects of natural language, based on matrix multiplication. We argue for the structural and cognitive plausibility of this model and show that it is able to cover and combine various common compositional NLP approaches ranging from statistical word space models to symbolic grammar formalisms.
6 0.038916133 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System
7 0.036413301 185 acl-2010-Open Information Extraction Using Wikipedia
8 0.036300264 198 acl-2010-Predicate Argument Structure Analysis Using Transformation Based Learning
9 0.032472886 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences
10 0.031154037 239 acl-2010-Towards Relational POMDPs for Adaptive Dialogue Management
11 0.030807771 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries
12 0.030123571 9 acl-2010-A Joint Rule Selection Model for Hierarchical Phrase-Based Translation
13 0.029552186 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations
14 0.029297665 178 acl-2010-Non-Cooperation in Dialogue
15 0.029169956 159 acl-2010-Learning 5000 Relational Extractors
16 0.028911794 67 acl-2010-Computing Weakest Readings
17 0.028374691 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction
18 0.027865453 82 acl-2010-Demonstration of a Prototype for a Conversational Companion for Reminiscing about Images
19 0.027848426 94 acl-2010-Edit Tree Distance Alignments for Semantic Role Labelling
20 0.026965139 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages
topicId topicWeight
[(0, -0.085), (1, 0.012), (2, -0.002), (3, -0.021), (4, -0.019), (5, -0.03), (6, 0.038), (7, 0.022), (8, -0.016), (9, -0.05), (10, -0.0), (11, 0.022), (12, -0.026), (13, -0.048), (14, 0.02), (15, 0.017), (16, 0.033), (17, 0.058), (18, -0.011), (19, 0.039), (20, -0.025), (21, -0.025), (22, -0.004), (23, -0.004), (24, -0.012), (25, 0.003), (26, -0.035), (27, 0.03), (28, 0.048), (29, -0.018), (30, -0.008), (31, -0.02), (32, 0.021), (33, -0.012), (34, 0.014), (35, 0.051), (36, 0.026), (37, -0.052), (38, 0.08), (39, 0.065), (40, -0.054), (41, 0.037), (42, 0.129), (43, 0.01), (44, 0.193), (45, 0.048), (46, 0.097), (47, -0.066), (48, 0.108), (49, -0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.931207 222 acl-2010-SystemT: An Algebraic Approach to Declarative Information Extraction
Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Sriram Raghavan ; Frederick Reiss ; Shivakumar Vaithyanathan
Abstract: As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. SystemT uses a declarative rule language, AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. We compare SystemT’s approach against cascading grammars, both theoretically and with a thorough experimental evaluation. Our results show that SystemT can deliver result quality comparable to the state-of-the- art and an order of magnitude higher annotation throughput.
2 0.58172458 181 acl-2010-On Learning Subtypes of the Part-Whole Relation: Do Not Mix Your Seeds
Author: Ashwin Ittoo ; Gosse Bouma
Abstract: An important relation in information extraction is the part-whole relation. Ontological studies mention several types of this relation. In this paper, we show that the traditional practice of initializing minimally-supervised algorithms with a single set that mixes seeds of different types fails to capture the wide variety of part-whole patterns and tuples. The results obtained with mixed seeds ultimately converge to one of the part-whole relation types. We also demonstrate that all the different types of part-whole relations can still be discovered, regardless of the type characterized by the initializing seeds. We performed our experiments with a state-ofthe-art information extraction algorithm. 1
3 0.48746213 43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies
Author: Karin Murthy ; Tanveer A Faruquie ; L Venkata Subramaniam ; Hima Prasad K ; Mukesh Mohania
Abstract: We propose a novel method to automatically acquire a term-frequency-based taxonomy from a corpus using an unsupervised method. A term-frequency-based taxonomy is useful for application domains where the frequency with which terms occur on their own and in combination with other terms imposes a natural term hierarchy. We highlight an application for our approach and demonstrate its effectiveness and robustness in extracting knowledge from real-world data.
4 0.47600916 67 acl-2010-Computing Weakest Readings
Author: Alexander Koller ; Stefan Thater
Abstract: We present an efficient algorithm for computing the weakest readings of semantically ambiguous sentences. A corpus-based evaluation with a large-scale grammar shows that our algorithm reduces over 80% of sentences to one or two readings, in negligible runtime, and thus makes it possible to work with semantic representations derived by deep large-scale grammars.
5 0.46547329 138 acl-2010-Hunting for the Black Swan: Risk Mining from Text
Author: Jochen Leidner ; Frank Schilder
Abstract: In the business world, analyzing and dealing with risk permeates all decisions and actions. However, to date, risk identification, the first step in the risk management cycle, has always been a manual activity with little to no intelligent software tool support. In addition, although companies are required to list risks to their business in their annual SEC filings in the USA, these descriptions are often very highlevel and vague. In this paper, we introduce Risk Mining, which is the task of identifying a set of risks pertaining to a business area or entity. We argue that by combining Web mining and Information Extraction (IE) techniques, risks can be detected automatically before they materialize, thus providing valuable business intelligence. We describe a system that induces a risk taxonomy with concrete risks (e.g., interest rate changes) at its leaves and more abstract risks (e.g., financial risks) closer to its root node. The taxonomy is induced via a bootstrapping algorithms starting with a few seeds. The risk taxonomy is used by the system as input to a risk monitor that matches risk mentions in financial documents to the abstract risk types, thus bridging a lexical gap. Our system is able to automatically generate company specific “risk maps”, which we demonstrate for a corpus of earnings report conference calls.
6 0.4593097 64 acl-2010-Complexity Assumptions in Ontology Verbalisation
7 0.41462669 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.
8 0.41449019 182 acl-2010-On the Computational Complexity of Dominance Links in Grammatical Formalisms
9 0.40241131 235 acl-2010-Tools for Multilingual Grammar-Based Translation on the Web
10 0.39431974 186 acl-2010-Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two
11 0.35160363 66 acl-2010-Compositional Matrix-Space Models of Language
12 0.34069464 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers
13 0.34047213 259 acl-2010-WebLicht: Web-Based LRT Services for German
14 0.33778304 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns
15 0.33421332 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
16 0.32357797 111 acl-2010-Extracting Sequences from the Web
17 0.31787574 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages
18 0.31777579 61 acl-2010-Combining Data and Mathematical Models of Language Change
19 0.31524751 185 acl-2010-Open Information Extraction Using Wikipedia
20 0.31064391 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation
topicId topicWeight
[(14, 0.021), (25, 0.049), (33, 0.01), (39, 0.023), (42, 0.048), (44, 0.016), (59, 0.076), (73, 0.063), (74, 0.357), (78, 0.051), (83, 0.061), (84, 0.036), (98, 0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.7631073 222 acl-2010-SystemT: An Algebraic Approach to Declarative Information Extraction
Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Sriram Raghavan ; Frederick Reiss ; Shivakumar Vaithyanathan
Abstract: As information extraction (IE) becomes more central to enterprise applications, rule-based IE engines have become increasingly important. In this paper, we describe SystemT, a rule-based IE system whose basic design removes the expressivity and performance limitations of current systems based on cascading grammars. SystemT uses a declarative rule language, AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. We compare SystemT’s approach against cascading grammars, both theoretically and with a thorough experimental evaluation. Our results show that SystemT can deliver result quality comparable to the state-of-the- art and an order of magnitude higher annotation throughput.
2 0.59007108 202 acl-2010-Reading between the Lines: Learning to Map High-Level Instructions to Commands
Author: S.R.K. Branavan ; Luke Zettlemoyer ; Regina Barzilay
Abstract: In this paper, we address the task of mapping high-level instructions to sequences of commands in an external environment. Processing these instructions is challenging—they posit goals to be achieved without specifying the steps required to complete them. We describe a method that fills in missing information using an automatically derived environment model that encodes states, transitions, and commands that cause these transitions to happen. We present an efficient approximate approach for learning this environment model as part of a policygradient reinforcement learning algorithm for text interpretation. This design enables learning for mapping high-level instructions, which previous statistical methods cannot handle.1
3 0.40684482 158 acl-2010-Latent Variable Models of Selectional Preference
Author: Diarmuid O Seaghdha
Abstract: This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to previously proposed techniques, these models perform very competitively, especially for infrequent predicate-argument combinations where they exceed the quality of Web-scale predictions while using relatively little data.
4 0.40158641 214 acl-2010-Sparsity in Dependency Grammar Induction
Author: Jennifer Gillenwater ; Kuzman Ganchev ; Joao Graca ; Fernando Pereira ; Ben Taskar
Abstract: A strong inductive bias is essential in unsupervised grammar induction. We explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. Specifically, we investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In ex- periments with 12 languages, we achieve substantial gains over the standard expectation maximization (EM) baseline, with average improvement in attachment accuracy of 6.3%. Further, our method outperforms models based on a standard Bayesian sparsity-inducing prior by an average of 4.9%. On English in particular, we show that our approach improves on several other state-of-the-art techniques.
5 0.39948177 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification
Author: Omri Abend ; Ari Rappoport
Abstract: The core-adjunct argument distinction is a basic one in the theory of argument structure. The task of distinguishing between the two has strong relations to various basic NLP tasks such as syntactic parsing, semantic role labeling and subcategorization acquisition. This paper presents a novel unsupervised algorithm for the task that uses no supervised models, utilizing instead state-of-the-art syntactic induction algorithms. This is the first work to tackle this task in a fully unsupervised scenario.
6 0.39875168 251 acl-2010-Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews
7 0.39816809 70 acl-2010-Contextualizing Semantic Representations Using Syntactically Enriched Vector Models
8 0.39629614 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons
9 0.39559847 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
10 0.39511633 248 acl-2010-Unsupervised Ontology Induction from Text
11 0.39464295 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns
12 0.39445928 130 acl-2010-Hard Constraints for Grammatical Function Labelling
13 0.3940995 65 acl-2010-Complexity Metrics in an Incremental Right-Corner Parser
14 0.39384824 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
15 0.39356863 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction
16 0.39307061 71 acl-2010-Convolution Kernel over Packed Parse Forest
17 0.39284942 238 acl-2010-Towards Open-Domain Semantic Role Labeling
18 0.39267769 162 acl-2010-Learning Common Grammar from Multilingual Corpus
19 0.39243513 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
20 0.39156255 121 acl-2010-Generating Entailment Rules from FrameNet