emnlp emnlp2012 emnlp2012-125 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ajay Nagesh ; Ganesh Ramakrishnan ; Laura Chiticariu ; Rajasekar Krishnamurthy ; Ankush Dharkar ; Pushpak Bhattacharyya
Abstract: Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a fourstage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization ofbasic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). [sent-14, score-0.424]
2 Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. [sent-15, score-0.284]
3 We present an efficient rule induction process, modeled on a fourstage manual rule development process and present initial promising results with our system. [sent-16, score-0.665]
4 We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization ofbasic features on the accuracy and complexity of induced rules. [sent-17, score-0.895]
5 We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure. [sent-18, score-0.491]
6 Furthermore, rules are typically easier to understand by an IE developer and can be customized for a new domain without requiring additional labeled data. [sent-45, score-0.325]
7 A common approach to address this problem is to build a generic NER extractor and then customize it for specific domains. [sent-59, score-0.279]
8 In this paper, we present initial work towards facilitating the process of building a generic NER extractor using induction techniques. [sent-63, score-0.455]
9 Specifically, given as input an annotated docu- ment corpus, a set of BF rules, and a default CO rule for each entity type, our goal is to generate a set of CD and CR rules such that the resulting extractor constitutes a good starting point for further refinement by a developer. [sent-64, score-0.747]
10 Since the generic NER extractor has to be manually customized, a major challenge is to ensure that the generated rules have good accuracy, and, at the same time, that they are not too complex, and consequently interpretable. [sent-65, score-0.482]
11 An efficient system for NER rule induction, using a highly expressive rule language (AQL) as the target language. [sent-67, score-0.442]
12 The first phase of rule induction uses a combination of clustering and relative least general generalization (RLGG) techniques to learn CD rules. [sent-68, score-0.496]
13 The second phase identifies CR rules using a propositional rule learner like JRIP to learn accurate compositions of CD rules. [sent-69, score-0.605]
14 Usage of induction biases to enhance the interpretability of rules. [sent-71, score-0.397]
15 These biases capture the expertise gleaned from manual rule development and constrain the search space in our induction system. [sent-72, score-0.57]
16 Definition of an initial notion of extractor complexity to quantify the interpretability of an extractor and to guide the process of adding induction biases to favor learning less complex extractors. [sent-74, score-0.936]
17 129 Roadmap We first describe preliminaries on SystemT and AQL (Section 3) and define the target language for our induction algorithm and the notion of rule complexity (Section 4). [sent-80, score-0.5]
18 We then present our approach for inducing CD and CR rules, and discuss induction biases that would favor interpretability (Section 5), and discuss the results of an empirical evaluation (Section 6). [sent-81, score-0.43]
19 (2009) and Soderland (1999) elaborate on top-down techniques for induction of IE rules, whereas (Califf and Mooney, 1997; Califf and Mooney, 1999) discuss a bottom-up IE rule induction system that uses the relative least general generalization (RLGG) of examples1 . [sent-85, score-0.573]
20 As discussed in Section 3, contextual clues and higher level rule interactions such as filtering and join are very difficult, if not impossible to express in such representations without resorting to custom code. [sent-87, score-0.32]
21 Our technique for learning higher level interactions is similar to the induction of ripple down rules (Gaines and Compton, 1995), which, to the best of our knowledge, has not been previously applied to IE. [sent-89, score-0.379]
22 We present complementary techniques for inducing an initial extractor that can be automatically refined in this framework. [sent-92, score-0.282]
23 As an example, rule R1 uses the extract statement to identify matches (Caps spans) of a 130 regular expression for capitalized words. [sent-104, score-0.368]
24 The select statement is similar to the SQL select statement but it contains an additional consolidate on clause (explained further), along with an extensive collection of text-specific predicates. [sent-105, score-0.284]
25 For each triplet ofFirst, Last and Caps spans satisfying the two predicates, the CombineSpans built-in scalar function in the select clause constructs larger PersonFirstLast spans that begin at the begin position of the First span, and end at the end position of the Last (also Caps) span. [sent-108, score-0.348]
26 For example, rule R6 unions person candidates identified by rules R4 and R5. [sent-110, score-0.493]
27 For example, rule R8 defines a view PersonAll by filtering out PersonInvalid tuples from the set of PersonCandidate tuples. [sent-112, score-0.299]
28 Notice that rule R7 used to define the view PersonInvalid illustrates another join predicate of AQL called Overlaps, which returns true if its two argument spans overlap in the input text. [sent-113, score-0.44]
29 Therefore, at a high level, rule R8 removes person candidates that overlap with an Organization span. [sent-114, score-0.29]
30 ) The consolidate clause of a select statement removes selected overlapping spans from the indicated column of the input tuples, according to the specified policy (for instance, ‘ContainedWithin’). [sent-116, score-0.306]
31 For example, rule R9 retains PersonAll spans that are not contained in other PersonAll spans. [sent-117, score-0.33]
32 The decoupling between AQL and the operator algebra allows for greater rule expressivity because the rule language is not constrained by the need to compile to a finite state transducer, as in grammar systems based on the CPSL standard. [sent-121, score-0.473]
33 4 Induction Target Language Our goal is to automatically generate NER extractors with good quality, and at the same time, manageable complexity, so that the extractors can be fur- ther refined and customized by the developer. [sent-127, score-0.543]
34 To this end, we focus on inducing extractors using the subset of AQL constructs described in Section 3. [sent-128, score-0.287]
35 Basic features (BF): BF views are specified using the extract statement, such as rules R1 to R3 in Figure 1. [sent-134, score-0.339]
36 Candidate definition (CD): CD views are expressed using the select statement to combine BF views with join predicates (e. [sent-136, score-0.492]
37 Candidate refinement (CR): CR views are used to discard spans output by the CD views that may be incorrect. [sent-144, score-0.443]
38 Consolidation (CO): Finally, a select statement with a fixed consolidate clause is used for each entity type to remove overlapping spans from CR views. [sent-150, score-0.349]
39 Since our goal is to generate extractors with manageable complexity, we must introduce a quantitative measure of extractor complexity, in order to (1) judge the complexity of the extractors generated by our system, and (2) reduce the search space considered by the induction system. [sent-153, score-0.887]
40 To this end, we define a simple complexity score that is a function of the number of rules, and the number of input views to each rule of the extractor. [sent-154, score-0.46]
41 In particular, we define the length of rule R, denoted as L(R), as the number of input views in the from clause(s) of the view. [sent-155, score-0.357]
42 We define the complexity of extractor E, denoted as C(E) as the sum of lengths of all rules of E. [sent-163, score-0.524]
43 For example, the complexity of the Person extractor from Figure 1 is 15, plus the length of all rules involved in defining Organization, which are omitted from the figure. [sent-164, score-0.524]
44 However, we shall show that the complexity score significantly reduces the search space of our induction techniques leading to phase and the corresponding type of rule in manual rule development. [sent-175, score-0.832]
45 5 Induction of Rules Since the goal is to generate rules that can be customized by humans, the overall structure of the induced rules must be similar in spirit to what a developer following best practices would write. [sent-177, score-0.61]
46 In Table 2, we summarize the phases of our induction algorithm, along with the subset of AQL constructs that comprise the language of the rules learnt in that phase, the possible methods prescribed for inducing the rules and their correspondence with the stages in the manual rule development. [sent-180, score-0.97]
47 Our induction system generates rules for two of the four categories, namely CD and CR rules as highlighted in Figure 2. [sent-181, score-0.582]
48 Broadly speaking, this is an 132 attribute-value table formed by all the views induced in the first phase along with the textual spans generated by them. [sent-190, score-0.391]
49 The attribute-value table is used as input to a propositional rule learner such as JRIP to learn accurate compositions of a useful (as determined by the learning algorithm) subset of the CD rules. [sent-191, score-0.338]
50 At various phases, several induction biases are introduced to enhance the interpretability of rules. [sent-194, score-0.397]
51 These biases capture the expertise gleaned from manual rule development and constrain the search space in our induction system. [sent-195, score-0.57]
52 SystemT provides a very fast rule execution engine and is crucial in our induction system as we test multiple hypotheses in the search for the more promising ones. [sent-197, score-0.424]
53 The basic views are compiled and executed in SystemT over the training document collection and the resulting spans are represented by equivalent predicates in first order logic. [sent-204, score-0.369]
54 The CD views from phase 1 along with the textual spans they generate, yield the span-view table. [sent-240, score-0.309]
55 This attribute-value table is used as input to a propositional rule learner Figure 4: Span-View Table like JRIP to learn compositions of CD views. [sent-245, score-0.338]
56 Based on our study of different propositional rule learners, we decided to use RIPPER (F ¨urnkranz and Widmer, 1994) implemented as the JRIP classifier in weka (Witten et al. [sent-247, score-0.278]
57 Some considerations that favor JRIP are (i) absence of rule ordering, (ii) ease of conversion to AQL and (iii) amenability to add induction biases in the implementation. [sent-249, score-0.496]
58 A number of syntactic biases were introduced in JRIP to aid in the interpretability of the induced rules. [sent-250, score-0.303]
59 We observed in our manually developed rules that CR rules for a type involve interaction between CDs for the same type and negations (not-overlaps, not matches) of CDs of the other types. [sent-251, score-0.406]
60 This rule filters out wrong person annotations like “Prince William ” in Prince William Sound. [sent-257, score-0.29]
61 Such an AQL rule will filter all those occurrences of Prince William from the list of 3Two consecutive spans where the 1st is FirstName and CapsPerson and the 2nd is LastName and CapsPerson. [sent-260, score-0.359]
62 A simple consolidation policy that we have incorporated in the system is as follows: union all the rules of a particular type, then perform a contained within consolidation, resulting in the final set of consolidated views for each named entity type. [sent-264, score-0.477]
63 Table 3 shows the accuracy and complexity of rules induced with the three basic feature sets E1, E2 and E3, respectively 6. [sent-305, score-0.438]
64 As we increase the number of BFs, the accuracies of the induced extractors increases, at the cost of an increase in complexity. [sent-310, score-0.277]
65 We compared the induced extractors with the manually developed extractors of (Chiticariu et al. [sent-316, score-0.472]
66 Table 4 shows the accuracy and complexity of the induced rules with E2 and E3 and the manual extractors for the generic domain and, re- spectively, customized for the CoNLL03 domain. [sent-320, score-0.813]
67 Our technique compares reasonably with the manually constructed generic extractor for two of the three entity types; and on precision for all entity types, especially since our system generated the rules in 1hour, whereas the development of manual rules took much longer 7. [sent-322, score-0.818]
68 The manual extractors also contain a larger number of rules covering many different cases, improving the accuracy, but also leading to a higher complexity score. [sent-327, score-0.548]
69 To better analyze the complexity, we also computed the average rule length for each extrac- tor by dividing the complexity score by the number of AQL views of the extractor. [sent-328, score-0.46]
70 1 for the generic and customized extractors of (Chiticariu et al. [sent-333, score-0.378]
71 The average rule length increases from the generic extractor to the customized extractor in both cases. [sent-335, score-0.84]
72 On average, however, an individual induced rule is slightly smaller than a manually developed rule. [sent-336, score-0.303]
73 The biases added to the system are broadly of two types: (i) Partition of basic features based on types (ii) Restriction on the type of CD views that can appear in a CR view. [sent-339, score-0.285]
74 , person CR view can contain only person CD views as positive clues and CD views of other types as negative clues. [sent-345, score-0.501]
75 Including both BFs in a CD rule leads to a larger rule that is unintuitive for a developer. [sent-350, score-0.442]
76 The latter type of bias prevents CD rules of one type to appear as positive clues for a CR rule of a different type. [sent-352, score-0.503]
77 The inclusion of an Organization CD rule as a positive clue for a Person CR rule is unintuitive for a developer. [sent-356, score-0.442]
78 Table 4, shows the effect (for E2 and E3) on the test dataset of disabling and enabling bias during the induction of CR rules using JRIP. [sent-357, score-0.418]
79 This comes at the cost of an increase in extractor complexity and average rule length. [sent-360, score-0.542]
80 Overall, our results show that biases lead to less complex extractors with only a very minor effect on accuracy, thus biases are important factors contributing to inducing rules that are understandable and may be refined by humans. [sent-364, score-0.66]
81 , 2010b) to lack in some of the constructs (such as minus) that AQL provides and which form a part of our target language (especially the rule refinement phase). [sent-376, score-0.342]
82 However, despite experimenting with all possible parameter configurations for each of these (in each of E1, E2 and E3 settings), the accuracies obtained were substantially (30-50%) worse and the extractor complexity was much (around 60%) higher when compared to our system (with or without bias). [sent-377, score-0.321]
83 We found that CR rules learned by JRIP consist of a strong CD rule (high precision, typically involving a dictionary) and a weak CD rule (low precision, typically involving only regular expressions). [sent-384, score-0.738]
84 The strong CD rule always corresponded to a positive clue (match) and the weak CD rule corresponded to the negative clue (overlaps or not-matches). [sent-385, score-0.475]
85 This is posited to be the way the CR rule learner operates it tries to learn conjunctions of weak and strong clues so as to filter one from the other. [sent-388, score-0.349]
86 Therefore, setting a precision threshold too high limited the number of such weak clues and the ability of the CR rule learner to find such rules. [sent-389, score-0.32]
87 The complexity is very helpful in comparing alternative rule sets based on – Generic (E2) OPERGR 8725P. [sent-393, score-0.324]
88 Second, rule developers use semantically meaningful view names such as those shown in Figure 1 to help them recall the semantics of a rule at a high-level, an aspect that is not captured by the complexity measure. [sent-404, score-0.63]
89 In simple terms, an extractor consisting of 5 rules of size 1 is indistinguishable from an extractor consisting of a single rule of size 5, and it is arguable which of these extractors is more interpretable. [sent-407, score-1.055]
90 When informally examining the rules induced by our system, we found that CD rules are similar in spirit to those written by rule developers. [sent-409, score-0.709]
91 On the other hand, the induced CR rules are too fine-grained. [sent-410, score-0.285]
92 In general, rule developers group CD rules with similar semantics, then write refinement rules at the higher level of the group, as op- posed to the lower level of individual CD views. [sent-411, score-0.723]
93 In contrast, our induction algorithm considers CR rules consisting of combinations of CD rules directly, leading to many semantically 137 similar CR rules, each operating over small parts of a larger semantic group (see rule in Section 6. [sent-419, score-0.803]
94 This nuance is not captured by the complexity score which may deem an extractor consisting of many rules, where many of the rules operate at higher levels of groups of candidates to be more complex than a smaller extractor with many fine-grained rules. [sent-422, score-0.742]
95 Indeed, as shown before, the complexity of the induced extractors is much smaller compared to that of manual extractors, although the latter follow the semantic grouping principle and are considered more interpretable. [sent-423, score-0.427]
96 7 Conclusion We presented a system for efficiently inducing named entity annotation rules in the AQL language. [sent-424, score-0.308]
97 The design of our approach is aimed at producing accurate rules that can be understood and refined by humans, by placing special emphasis on low complexity and efficient computation of the induced rules, while mimicking a four stage approach used for manually constructing rules. [sent-425, score-0.419]
98 The induced rules have good accuracy and low complexity according to our complexity measure. [sent-426, score-0.491]
99 While our complexity measure informs the biases in our system and leads to simpler, smaller extractors, it captures extractor interpretability only to a certain extent. [sent-427, score-0.542]
100 doc / doc /bigins ight s_aql re f_con_ aql -overview . [sent-504, score-0.433]
wordName wordTfidf (topN-words)
[('aql', 0.433), ('cd', 0.283), ('rule', 0.221), ('extractor', 0.218), ('chiticariu', 0.21), ('rules', 0.203), ('extractors', 0.195), ('induction', 0.176), ('cr', 0.17), ('systemt', 0.157), ('rlggs', 0.144), ('views', 0.136), ('jrip', 0.131), ('rlgg', 0.131), ('customized', 0.122), ('interpretability', 0.122), ('spans', 0.109), ('complexity', 0.103), ('biases', 0.099), ('bfs', 0.092), ('statement', 0.087), ('induced', 0.082), ('capsperson', 0.079), ('cds', 0.079), ('loccd', 0.079), ('ner', 0.077), ('predicates', 0.074), ('bf', 0.071), ('clause', 0.071), ('person', 0.069), ('capsorg', 0.066), ('consolidation', 0.066), ('orgcd', 0.066), ('percd', 0.066), ('rajasekar', 0.066), ('phase', 0.064), ('refinement', 0.062), ('generic', 0.061), ('regular', 0.06), ('span', 0.059), ('constructs', 0.059), ('join', 0.059), ('propositional', 0.057), ('krishnamurthy', 0.056), ('califf', 0.052), ('customization', 0.052), ('maynard', 0.052), ('reiss', 0.052), ('view', 0.051), ('basic', 0.05), ('ibm', 0.047), ('manual', 0.047), ('muggleton', 0.045), ('entity', 0.043), ('dictionaries', 0.043), ('overlaps', 0.042), ('frederick', 0.041), ('caps', 0.041), ('generalizations', 0.041), ('laura', 0.041), ('clues', 0.04), ('algebraic', 0.039), ('capsorgr', 0.039), ('consolidate', 0.039), ('cpsl', 0.039), ('itb', 0.039), ('personall', 0.039), ('sriram', 0.039), ('urnkranz', 0.039), ('yunyao', 0.039), ('organization', 0.039), ('bias', 0.039), ('shivakumar', 0.038), ('clusters', 0.038), ('ie', 0.036), ('clustering', 0.035), ('developers', 0.034), ('waugh', 0.034), ('prince', 0.034), ('compositions', 0.034), ('loc', 0.034), ('inducing', 0.033), ('weak', 0.033), ('wolf', 0.031), ('gate', 0.031), ('algebra', 0.031), ('refined', 0.031), ('minus', 0.03), ('named', 0.029), ('filter', 0.029), ('phases', 0.028), ('raghavan', 0.028), ('specification', 0.028), ('tuples', 0.027), ('expertise', 0.027), ('execution', 0.027), ('learner', 0.026), ('abiteboul', 0.026), ('aleph', 0.026), ('almaden', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 125 emnlp-2012-Towards Efficient Named-Entity Rule Induction for Customizability
Author: Ajay Nagesh ; Ganesh Ramakrishnan ; Laura Chiticariu ; Rajasekar Krishnamurthy ; Ankush Dharkar ; Pushpak Bhattacharyya
Abstract: Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a fourstage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization ofbasic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.
2 0.081431046 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
Author: Abby Levenberg ; Chris Dyer ; Phil Blunsom
Abstract: We describe a nonparametric model and corresponding inference algorithm for learning Synchronous Context Free Grammar derivations for parallel text. The model employs a Pitman-Yor Process prior which uses a novel base distribution over synchronous grammar rules. Through both synthetic grammar induction and statistical machine translation experiments, we show that our model learns complex translational correspondences— including discontiguous, many-to-many alignments—and produces competitive translation results. Further, inference is efficient and we present results on significantly larger corpora than prior work.
3 0.077827513 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
Author: Fei Huang ; Alexander Yates
Abstract: Representation learning is a promising technique for discovering features that allow supervised classifiers to generalize from a source domain dataset to arbitrary new domains. We present a novel, formal statement of the representation learning task. We argue that because the task is computationally intractable in general, it is important for a representation learner to be able to incorporate expert knowledge during its search for helpful features. Leveraging the Posterior Regularization framework, we develop an architecture for incorporating biases into representation learning. We investigate three types of biases, and experiments on two domain adaptation tasks show that our biased learners identify significantly better sets of features than unbiased learners, resulting in a relative reduction in error of more than 16% for both tasks, with respect to existing state-of-the-art representation learning techniques.
4 0.075377457 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction
Author: Yang Feng ; Yang Liu ; Qun Liu ; Trevor Cohn
Abstract: Decoding algorithms for syntax based machine translation suffer from high computational complexity, a consequence of intersecting a language model with a context free grammar. Left-to-right decoding, which generates the target string in order, can improve decoding efficiency by simplifying the language model evaluation. This paper presents a novel left to right decoding algorithm for tree-to-string translation, using a bottom-up parsing strategy and dynamic future cost estimation for each partial translation. Our method outperforms previously published tree-to-string decoders, including a competing left-to-right method.
5 0.074497499 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
Author: Jayant Krishnamurthy ; Tom Mitchell
Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.
6 0.066484474 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
7 0.057662811 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
8 0.050303176 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence
9 0.049553156 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
10 0.046142373 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
11 0.045979612 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence
12 0.045831691 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
13 0.042276304 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
14 0.041455582 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction
15 0.03991152 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories
16 0.038895179 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis
17 0.038878374 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
18 0.038211908 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
19 0.037621457 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
20 0.037114497 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
topicId topicWeight
[(0, 0.16), (1, 0.022), (2, 0.008), (3, -0.027), (4, -0.006), (5, 0.004), (6, 0.013), (7, 0.051), (8, -0.021), (9, 0.001), (10, -0.033), (11, 0.083), (12, -0.096), (13, 0.074), (14, -0.061), (15, 0.037), (16, -0.027), (17, -0.003), (18, -0.069), (19, 0.048), (20, 0.076), (21, -0.002), (22, 0.03), (23, 0.028), (24, -0.026), (25, -0.035), (26, -0.083), (27, 0.087), (28, 0.004), (29, -0.145), (30, -0.063), (31, 0.084), (32, 0.022), (33, -0.29), (34, 0.064), (35, 0.296), (36, 0.078), (37, -0.058), (38, -0.116), (39, 0.203), (40, -0.198), (41, 0.1), (42, 0.099), (43, 0.071), (44, -0.029), (45, -0.116), (46, -0.169), (47, 0.172), (48, -0.144), (49, -0.056)]
simIndex simValue paperId paperTitle
same-paper 1 0.976641 125 emnlp-2012-Towards Efficient Named-Entity Rule Induction for Customizability
Author: Ajay Nagesh ; Ganesh Ramakrishnan ; Laura Chiticariu ; Rajasekar Krishnamurthy ; Ankush Dharkar ; Pushpak Bhattacharyya
Abstract: Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a fourstage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization ofbasic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.
2 0.47961116 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules
Author: Abby Levenberg ; Chris Dyer ; Phil Blunsom
Abstract: We describe a nonparametric model and corresponding inference algorithm for learning Synchronous Context Free Grammar derivations for parallel text. The model employs a Pitman-Yor Process prior which uses a novel base distribution over synchronous grammar rules. Through both synthetic grammar induction and statistical machine translation experiments, we show that our model learns complex translational correspondences— including discontiguous, many-to-many alignments—and produces competitive translation results. Further, inference is efficient and we present results on significantly larger corpora than prior work.
3 0.47367033 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction
Author: Yang Feng ; Yang Liu ; Qun Liu ; Trevor Cohn
Abstract: Decoding algorithms for syntax based machine translation suffer from high computational complexity, a consequence of intersecting a language model with a context free grammar. Left-to-right decoding, which generates the target string in order, can improve decoding efficiency by simplifying the language model evaluation. This paper presents a novel left to right decoding algorithm for tree-to-string translation, using a bottom-up parsing strategy and dynamic future cost estimation for each partial translation. Our method outperforms previously published tree-to-string decoders, including a competing left-to-right method.
4 0.33750194 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings
Author: Doo Soon Kim ; Kunal Verma ; Peter Yeh
Abstract: Short listings such as classified ads or product listings abound on the web. If a computer can reliably extract information from them, it will greatly benefit a variety of applications. Short listings are, however, challenging to process due to their informal styles. In this paper, we present an unsupervised information extraction system for short listings. Given a corpus of listings, the system builds a semantic model that represents typical objects and their attributes in the domain of the corpus, and then uses the model to extract information. Two key features in the system are a semantic parser that extracts objects and their attributes and a listing-focused clustering module that helps group together extracted tokens of same type. Our evaluation shows that the , semantic model learned by these two modules is effective across multiple domains.
5 0.31802097 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
Author: Fei Huang ; Alexander Yates
Abstract: Representation learning is a promising technique for discovering features that allow supervised classifiers to generalize from a source domain dataset to arbitrary new domains. We present a novel, formal statement of the representation learning task. We argue that because the task is computationally intractable in general, it is important for a representation learner to be able to incorporate expert knowledge during its search for helpful features. Leveraging the Posterior Regularization framework, we develop an architecture for incorporating biases into representation learning. We investigate three types of biases, and experiments on two domain adaptation tasks show that our biased learners identify significantly better sets of features than unbiased learners, resulting in a relative reduction in error of more than 16% for both tasks, with respect to existing state-of-the-art representation learning techniques.
6 0.28442964 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
7 0.24749635 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules
8 0.24555118 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
9 0.23802674 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
10 0.23213281 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
11 0.20731434 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence
12 0.20565186 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types
13 0.1962457 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT
14 0.19605376 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
15 0.18021108 44 emnlp-2012-Excitatory or Inhibitory: A New Semantic Orientation Extracts Contradiction and Causality from the Web
16 0.17809251 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming
17 0.17642336 54 emnlp-2012-Forced Derivation Tree based Model Training to Statistical Machine Translation
18 0.17551391 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
19 0.17521292 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation
20 0.1745393 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
topicId topicWeight
[(2, 0.018), (16, 0.029), (25, 0.018), (34, 0.052), (45, 0.015), (60, 0.064), (63, 0.043), (64, 0.019), (65, 0.018), (70, 0.011), (73, 0.463), (74, 0.089), (76, 0.036), (80, 0.021), (86, 0.018), (95, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.85144973 125 emnlp-2012-Towards Efficient Named-Entity Rule Induction for Customizability
Author: Ajay Nagesh ; Ganesh Ramakrishnan ; Laura Chiticariu ; Rajasekar Krishnamurthy ; Ankush Dharkar ; Pushpak Bhattacharyya
Abstract: Generic rule-based systems for Information Extraction (IE) have been shown to work reasonably well out-of-the-box, and achieve state-of-the-art accuracy with further domain customization. However, it is generally recognized that manually building and customizing rules is a complex and labor intensive process. In this paper, we discuss an approach that facilitates the process of building customizable rules for Named-Entity Recognition (NER) tasks via rule induction, in the Annotation Query Language (AQL). Given a set of basic features and an annotated document collection, our goal is to generate an initial set of rules with reasonable accuracy, that are interpretable and thus can be easily refined by a human developer. We present an efficient rule induction process, modeled on a fourstage manual rule development process and present initial promising results with our system. We also propose a simple notion of extractor complexity as a first step to quantify the interpretability of an extractor, and study the effect of induction bias and customization ofbasic features on the accuracy and complexity of induced rules. We demonstrate through experiments that the induced rules have good accuracy and low complexity according to our complexity measure.
2 0.79472679 112 emnlp-2012-Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge
Author: Altaf Rahman ; Vincent Ng
Abstract: We examine the task of resolving complex cases of definite pronouns, specifically those for which traditional linguistic constraints on coreference (e.g., Binding Constraints, gender and number agreement) as well as commonly-used resolution heuristics (e.g., string-matching facilities, syntactic salience) are not useful. Being able to solve this task has broader implications in artificial intelligence: a restricted version of it, sometimes referred to as the Winograd Schema Challenge, has been suggested as a conceptually and practically appealing alternative to the Turing Test. We employ a knowledge-rich approach to this task, which yields a pronoun resolver that outperforms state-of-the-art resolvers by nearly 18 points in accuracy on our dataset.
3 0.70699674 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model
Author: Xian-Ling Mao ; Zhao-Yan Ming ; Tat-Seng Chua ; Si Li ; Hongfei Yan ; Xiaoming Li
Abstract: Supervised hierarchical topic modeling and unsupervised hierarchical topic modeling are usually used to obtain hierarchical topics, such as hLLDA and hLDA. Supervised hierarchical topic modeling makes heavy use of the information from observed hierarchical labels, but cannot explore new topics; while unsupervised hierarchical topic modeling is able to detect automatically new topics in the data space, but does not make use of any information from hierarchical labels. In this paper, we propose a semi-supervised hierarchical topic model which aims to explore new topics automatically in the data space while incorporating the information from observed hierarchical labels into the modeling process, called SemiSupervised Hierarchical Latent Dirichlet Allocation (SSHLDA). We also prove that hLDA and hLLDA are special cases of SSHLDA. We . conduct experiments on Yahoo! Answers and ODP datasets, and assess the performance in terms of perplexity and clustering. The experimental results show that predictive ability of SSHLDA is better than that of baselines, and SSHLDA can also achieve significant improvement over baselines for clustering on the FScore measure.
4 0.3467232 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
Author: Valentin I. Spitkovsky ; Hiyan Alshawi ; Daniel Jurafsky
Abstract: We present a new family of models for unsupervised parsing, Dependency and Boundary models, that use cues at constituent boundaries to inform head-outward dependency tree generation. We build on three intuitions that are explicit in phrase-structure grammars but only implicit in standard dependency formulations: (i) Distributions of words that occur at sentence boundaries such as English determiners resemble constituent edges. (ii) Punctuation at sentence boundaries further helps distinguish full sentences from fragments like headlines and titles, allowing us to model grammatical differences between complete and incomplete sentences. (iii) Sentence-internal punctuation boundaries help with longer-distance dependencies, since punctuation correlates with constituent edges. Our models induce state-of-the-art dependency grammars for many languages without — — special knowledge of optimal input sentence lengths or biased, manually-tuned initializers.
5 0.33372167 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
Author: Heeyoung Lee ; Marta Recasens ; Angel Chang ; Mihai Surdeanu ; Dan Jurafsky
Abstract: We introduce a novel coreference resolution system that models entities and events jointly. Our iterative method cautiously constructs clusters of entity and event mentions using linear regression to model cluster merge operations. As clusters are built, information flows between entity and event clusters through features that model semantic role dependencies. Our system handles nominal and verbal events as well as entities, and our joint formulation allows information from event coreference to help entity coreference, and vice versa. In a cross-document domain with comparable documents, joint coreference resolution performs significantly better (over 3 CoNLL F1 points) than two strong baselines that resolve entities and events separately.
6 0.33156744 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
7 0.32369217 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes
8 0.32181233 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure
9 0.31661621 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields
10 0.31522053 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
11 0.31316844 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
12 0.31303641 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction
13 0.31252226 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews
14 0.31105545 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction
15 0.30626652 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision
16 0.30398762 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution
17 0.30372256 64 emnlp-2012-Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints
18 0.30263895 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts
19 0.30007777 10 emnlp-2012-A Statistical Relational Learning Approach to Identifying Evidence Based Medicine Categories
20 0.29810074 95 emnlp-2012-N-gram-based Tense Models for Statistical Machine Translation