acl acl2013 acl2013-365 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vidhya Govindaraju ; Ce Zhang ; Christopher Re
Abstract: Tabular information in text documents contains a wealth of information, and so tables are a natural candidate for information extraction. There are many cues buried in both a table and its surrounding text that allow us to understand the meaning of the data in a table. We study how natural-language tools, such as part-of-speech tagging, dependency paths, and named-entity recognition, can be used to improve the quality of relation extraction from tables. In three domains we show that (1) a model that performs joint probabilistic inference across tabular and natural language features achieves an F1 score that is twice as high as either a puretable or pure-text system, and (2) using only shallower features or non-joint inference results in lower quality.
Reference: text
sentIndex sentText sentNum sentScore
1 There are many cues buried in both a table and its surrounding text that allow us to understand the meaning of the data in a table. [sent-2, score-0.289]
2 We study how natural-language tools, such as part-of-speech tagging, dependency paths, and named-entity recognition, can be used to improve the quality of relation extraction from tables. [sent-3, score-0.256]
3 Recent studies found billions of high-quality relations on the web in HTML (Cafarella et al. [sent-6, score-0.17]
4 In financial applications, a huge amount of data is buried in the tables of corporate filings and earnings reports; in science, millions of journal articles contain billions of scientific facts in tables. [sent-8, score-0.558]
5 Although tables describe precise, structured relations, tables are rarely written in a way that is self-describing, e. [sent-9, score-0.644]
6 , tables may contain abbreviations or only informal schema information; in turn, the contents of tables are often ambiguously specified, which makes extracting the relations implicit in tabular data difficult. [sent-11, score-1.194]
7 The text surrounding a table in a jour- chrisre}@cs. [sent-13, score-0.135]
8 edu nal article explains its contents to its intended audience, a human reader. [sent-15, score-0.077]
9 For example, in a simple study, we demonstrate that humans can achieve more than 60% higher recall by jointly reading the text and tables in a journal article than by only looking at the tables. [sent-16, score-0.534]
10 The conclusion of this experiment is not surprising, but it raises a question: How should a system combine tabular and natural-language features to understand tables in text? [sent-17, score-0.898]
11 Most previous approaches use textual or tabular features separately, e. [sent-19, score-0.603]
12 , tabular approaches that do not use text features (Dalvi et al. [sent-21, score-0.592]
13 , 2003) or textual approaches that do not use tabular features (Mintz et al. [sent-23, score-0.603]
14 (2007) proposed to learn the target relation independently from both table and surface textual features, and then combine the result us- ing a linear combination of the predictions. [sent-26, score-0.16]
15 In a similar spirit, we propose to use both types of features in our approach of relation extraction. [sent-27, score-0.147]
16 Our proposed approach differs from prior approaches in two ways: (1) We use deeper–but standard–NLP features than prior approaches for table extraction. [sent-28, score-0.176]
17 In contrast to the shallow, lexical features that prior approaches have used, we use standard NLP features, such as dependency paths, parts of speech, etc. [sent-29, score-0.129]
18 Our hypothesis is that a deeper understanding of the text in which a table is embedded will lead to higher quality table extraction. [sent-30, score-0.279]
19 (2) Our probabilistic model jointly uses both tabular and textual features. [sent-31, score-0.51]
20 One advantage of a joint approach is that one can predict portions of the complicated predicate that is buried in a table. [sent-32, score-0.283]
21 For example, in a geology journal article, we may read a measure658 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-33, score-0.358]
22 33n where this rock was unearthed and in what geological time interval this rock appeared may notth b foerm satipone ocif thieesde iocnk ,th cryset lt afrabctlioena. [sent-131, score-0.322]
23 OF2oadra each dleteom abseancien of, oweveal build ad b sasayltsict meamgm to extract orse s-h lataidoaknites b elftr. [sent-134, score-0.108]
24 s 3,a wep cttearnns, achTiOie2v(Feig t 3wc)i anced aaOs hFigi. [sent-141, score-0.07]
25 ,r 2 Motivating Human Study We describe a simple human study that motivated our approach to jointly combine both tabular features and natural language features to extract relations from tables. [sent-143, score-0.811]
26 deaskiites s htowha a wt dwe rean gwe,a monstt ft tohe pvoast-lciodllisaiotne is that the text selud (rFirgo. [sent-180, score-0.044]
27 Ind tihneg tw ao ytpaesb olfe ad ackoiteus,l thde ptotralovide valuable intfso (Krm2O+aNta2iOo wnt %e)v, Ke2On a fuondra cae sh anudm K2Oa/Nna2 Oreader, and theretfroenrde o,f aalcn-a kiadlineea alrc mmagmacash (iFnig. [sent-182, score-0.053]
28 Wquenetl a thseskee sdam etsh sirgenifeic ngtley doespcartie fronmtists to manually sre noat sdim l jyo a urersunlt oafl m gamrattici cdiflfeerse aationn. [sent-185, score-0.065]
29 d extract relations fnoerarti ts hofe ad iPtice rotckr ino thleo Gagngdye bdeoltsm hoawin. [sent-186, score-0.169]
30 y a Onpvdere Sarl,la, thweh adiackihte a osf dsiof ecreinta atgeess a rock type with eas Frigo. [sent-189, score-0.161]
31 o jmoeu ofr tnhea pols a-corllitsi ocnl aeda, wtic reo kps frroomduced three variants: (1) the original document; (2) tableonly, which is the set of tables in the document (without the text) ; (3) text-only, which is the text of the document with the tables removed from the document. [sent-193, score-0.768]
32 Each geoscientist was asked to read and extract the relations from one of the three variants. [sent-194, score-0.158]
33 We then judged the precision and recall of their extraction, as shown in Figure 2. [sent-195, score-0.09]
34 659 As shown in Figure 3, human readers not surprisingly achieve perfect precision on each of the variants, but lower recall on both the table-only and text-only variants. [sent-196, score-0.09]
35 However, summing the recall of table-only (60%) and text-only (20%) variants together would achieve only 80% recall; this implies that in the best case more than 20% of the extractions require that the human reader read the table and its surrounding text jointly. [sent-197, score-0.395]
36 This motivates our approach, which uses a joint inference system to model features from a table and its surrounding text. [sent-199, score-0.457]
37 We also propose to use deep linguistic features instead of shallower features to get as close as possible to the ability of human readers in understanding the surrounding text of a table. [sent-200, score-0.419]
38 1 Experimental Setup We consider the task of constructing a geology knowledge base. [sent-206, score-0.316]
39 Specifically, our goal is to extract a Rock-TotalOrganicCarbon relation that maps rock formations (e. [sent-207, score-0.281]
40 2 We asked three geoscientists to annotate these journal articles manually to extract the Rock-TotalOrganicCarbon relation (1. [sent-215, score-0.12]
41 We then extracted features following state-ofthe-art practices (see Figure 4) . [sent-220, score-0.082]
42 To validate our hypothesis, we implement four systems, each of which has access to different types of data: (1) Table. [sent-222, score-0.071]
43 This approach only has access to the text in a document and contains all the features mentioned in Wu and Weld (2010) and Mintz et al. [sent-227, score-0.166]
44 The features used in (1) and (2) are shown in Figure 4. [sent-229, score-0.082]
45 Using Table and Text, we extract all facts and their associated probability. [sent-233, score-0.107]
46 Merge is a baseline approach that uses information from both tables and text. [sent-235, score-0.322]
47 We build a joint approach that uses information from both tables and text. [sent-237, score-0.486]
48 Recall that a key advantage of a joint approach is that we do not need to predict all arguments of the relation (if such a prediction is unwarranted from the data) . [sent-240, score-0.229]
49 The inference is done by Gibbs sampling using our inference engine Elementary (Zhang and R e´, 2013) . [sent-241, score-0.24]
50 2 End-to-End Quality We were able to validate that Joint achieves higher quality than the other three approaches we considered. [sent-244, score-0.189]
51 Figure 5 shows the P/R curve of different approaches on three domains. [sent-245, score-0.083]
52 At a recall of 10%, Joint achieves 3x higher precision than all other approaches. [sent-248, score-0.09]
53 In our error analysis, we saw that tables in geology articles often contain ambiguous words; for example, 3http : //pdftohtml . [sent-249, score-0.638]
54 2 Recall (a) Geology Domain Figure 5: End-to-end Recall (b) Petrology Domain Recall (c) Finance Domain extraction quality on Petrology, Finance, and GeoDeepDive. [sent-261, score-0.155]
55 recall is limited by the quality of state-of-the-art The table recognition software on PDFs. [sent-262, score-0.161]
56 the word “Barnett” in a table may refer to either a location or a rock formation. [sent-263, score-0.161]
57 By using features extracted from text, Joint achieves higher precision. [sent-264, score-0.082]
58 For recall in the range of 0– 10%, Merge outperforms both Text and Table, with 3%–90% improvement in precision. [sent-265, score-0.09]
59 In Geology, Merge has precision that is similar to Text and Table for the higher recall range (>10%) . [sent-266, score-0.09]
60 In this domain, we found that relations that appeared in the text often repeated relations described in the table. [sent-267, score-0.166]
61 In other domains, such as Petrology, where the relations in text and tables have lower degrees of overlap, Merge significantly improves over Text and Table (Figure 5(b)) . [sent-268, score-0.427]
62 We conducted a statistical significance test to check whether the improvement of Joint over the three other approaches is statistically significant. [sent-269, score-0.089]
63 Figure 6 shows the results of the statistical significance test in which the null hypothesis is that the F1 scores of two approaches are the same. [sent-277, score-0.135]
64 01, Joint has statistically significant improvement of F1 score over all three other approaches with each probability threshold. [sent-279, score-0.047]
65 Linguistic Features We validate the hypothesis that using linguistic features, e. [sent-282, score-0.117]
66 , 2006) , helps improve the quality of our approach, called Joint. [sent-286, score-0.071]
67 5+ +0 Figure 6: Approximate randomization test from Chinchor (1992) of F1 score with p = 0. [sent-290, score-0.053]
68 01 on the impact of joint inference compared with pure-table or pure-text approaches for different probability thresholds. [sent-291, score-0.331]
69 A + sign indicates that the F1 score of joint approach increased significantly. [sent-292, score-0.164]
70 state-of-the-art approaches from the literature (see Figure 7) . [sent-297, score-0.047]
71 Joint(-parse) removes features generated by the dependency parser and syntax parser. [sent-299, score-0.143]
72 Similarly, Joint(-ner) (Joint(-pos)) removes all features related to NER (resp. [sent-300, score-0.143]
73 Joint(-pos) also removes NER and parser features because the latter two are dependent on POS features. [sent-302, score-0.143]
74 Figure 8 shows the P/R curve for all these variants on Geology, and Figure 9 shows the results of statistical significance test. [sent-303, score-0.153]
75 3 Recall Figure 8: Lesion study of different features for Geology. [sent-310, score-0.118]
76 + 5+0 Figure 9: Approximate randomization test of F1 score with p = 0. [sent-314, score-0.053]
77 and Joint(-ner) is not significant because there are “easy-to-extract” facts in the highprobability range. [sent-318, score-0.087]
78 4 Related Work The intuition that context features might help table-related tasks has existed for decades. [sent-321, score-0.082]
79 For example, Hurst and Nasukawa (2000) mentioned (as future work) that context features could be used to further improve their relation extraction approaches from tables. [sent-322, score-0.278]
80 (2010) use bag-of-words features and hyperlinks to recommend new columns for web tables. [sent-324, score-0.126]
81 (2007) extract features, including font size and title, from PDF documents in which a table appears to help the table ranking task. [sent-326, score-0.055]
82 They find that these features only contribute less than 2% to precision. [sent-327, score-0.082]
83 In contrast, in our approach linguistic features are quite useful. [sent-328, score-0.082]
84 The above approaches use context features that can be extracted without POS tagging or linguistic parsing. [sent-329, score-0.129]
85 One aspect of our work is to demonstrate that traditional NLP tools can enhance the quality of table extraction. [sent-330, score-0.071]
86 Extracting information from tables has been discussed by different communities in the last decade, including NLP (Wu and Lee, 2006; Tengli et al. [sent-331, score-0.322]
87 This body of work considers only features derived from tables and does not examine richer NLP features as we do. [sent-339, score-0.486]
88 While joint inference is popular, it is not clear when a joint inference system outper- forms a more traditional NLP pipeline. [sent-340, score-0.568]
89 Recent studies have reached a variety of conclusions: in some, joint inference helps extraction quality (McCallum, 2009; Poon and Domingos, 2007; Singh et al. [sent-341, score-0.439]
90 , 2009) ; and in some, joint inference hurts extraction quality (Poon and Domingos, 2007; Eisner, 2009) . [sent-342, score-0.439]
91 Our intuition is that joint inference is helpful in this application because our joint inference approach combines non-redundant signals (textual versus tabular) . [sent-343, score-0.568]
92 5 Conclusion To improve the quality of extractions of tabular data, we use standard NLP techniques to more deeply understand the text in which a table is embedded. [sent-344, score-0.622]
93 We validate that deeper NLP features combined with a joint probabilistic model has a statistically significant impact on quality, i. [sent-345, score-0.385]
94 WebTables: Exploring the power of tables on the web. [sent-359, score-0.322]
95 WebSets: Extracting sets of entities from the web using unsupervised information extraction. [sent-371, score-0.044]
96 Incorporating non-local information into information extraction systems by Gibbs sampling. [sent-393, score-0.084]
97 Layout and language: Integrating spatial and linguistic knowledge for layout understanding tasks. [sent-397, score-0.108]
98 TableSeer: Automatic table metadata extraction and searching in digital libraries. [sent-414, score-0.12]
99 Bi-directional joint inference for entity resolution and segmentation using imperatively-defined factor graphs. [sent-439, score-0.284]
100 A grammatical approach to understanding textual tables using two-dimensional scfgs. [sent-456, score-0.427]
wordName wordTfidf (topN-words)
[('tabular', 0.419), ('tables', 0.322), ('geology', 0.316), ('joint', 0.164), ('rock', 0.161), ('dalvi', 0.158), ('pinto', 0.121), ('inference', 0.12), ('buried', 0.119), ('petrology', 0.119), ('vidhya', 0.119), ('poon', 0.1), ('surrounding', 0.091), ('recall', 0.09), ('extraction', 0.084), ('wu', 0.082), ('features', 0.082), ('barnett', 0.079), ('govindaraju', 0.079), ('iptice', 0.079), ('tengli', 0.079), ('domingos', 0.077), ('variants', 0.075), ('mintz', 0.075), ('quality', 0.071), ('cafarella', 0.071), ('validate', 0.071), ('shallower', 0.07), ('hurst', 0.07), ('deeper', 0.068), ('merge', 0.066), ('relation', 0.065), ('prasenjit', 0.065), ('sre', 0.065), ('billions', 0.065), ('removes', 0.061), ('relations', 0.061), ('mitra', 0.058), ('layout', 0.058), ('textual', 0.055), ('extract', 0.055), ('randomization', 0.053), ('extractions', 0.053), ('ad', 0.053), ('facts', 0.052), ('marneffe', 0.052), ('ner', 0.052), ('afrl', 0.052), ('understanding', 0.05), ('fang', 0.05), ('ce', 0.05), ('singh', 0.049), ('finance', 0.047), ('xing', 0.047), ('weld', 0.047), ('approaches', 0.047), ('christopher', 0.047), ('hypothesis', 0.046), ('toutanova', 0.046), ('sigmod', 0.046), ('gibbs', 0.045), ('bruce', 0.045), ('cm', 0.045), ('web', 0.044), ('text', 0.044), ('andrew', 0.044), ('finkel', 0.043), ('read', 0.042), ('nlp', 0.042), ('wei', 0.042), ('article', 0.042), ('significance', 0.042), ('mccallum', 0.042), ('award', 0.041), ('combine', 0.04), ('shallow', 0.04), ('document', 0.04), ('zhang', 0.039), ('lee', 0.037), ('study', 0.036), ('jointly', 0.036), ('aaai', 0.036), ('digital', 0.036), ('curve', 0.036), ('paths', 0.036), ('contents', 0.035), ('conference', 0.035), ('understand', 0.035), ('anced', 0.035), ('wep', 0.035), ('cindy', 0.035), ('onl', 0.035), ('feig', 0.035), ('ambiguously', 0.035), ('etso', 0.035), ('ogf', 0.035), ('sourceforge', 0.035), ('jude', 0.035), ('shavlik', 0.035), ('highprobability', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 365 acl-2013-Understanding Tables in Context Using Standard NLP Toolkits
Author: Vidhya Govindaraju ; Ce Zhang ; Christopher Re
Abstract: Tabular information in text documents contains a wealth of information, and so tables are a natural candidate for information extraction. There are many cues buried in both a table and its surrounding text that allow us to understand the meaning of the data in a table. We study how natural-language tools, such as part-of-speech tagging, dependency paths, and named-entity recognition, can be used to improve the quality of relation extraction from tables. In three domains we show that (1) a model that performs joint probabilistic inference across tabular and natural language features achieves an F1 score that is twice as high as either a puretable or pure-text system, and (2) using only shallower features or non-joint inference results in lower quality.
2 0.10127402 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction
Author: Wei Xu ; Raphael Hoffmann ; Le Zhao ; Ralph Grishman
Abstract: Distant supervision has attracted recent interest for training information extraction systems because it does not require any human annotation but rather employs existing knowledge bases to heuristically label a training corpus. However, previous work has failed to address the problem of false negative training examples mislabeled due to the incompleteness of knowledge bases. To tackle this problem, we propose a simple yet novel framework that combines a passage retrieval model using coarse features into a state-of-the-art relation extractor using multi-instance learning with fine features. We adapt the information retrieval technique of pseudo- relevance feedback to expand knowledge bases, assuming entity pairs in top-ranked passages are more likely to express a relation. Our proposed technique significantly improves the quality of distantly supervised relation extraction, boosting recall from 47.7% to 61.2% with a consistently high level of precision of around 93% in the experiments.
3 0.10038092 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction
Author: Xingxing Zhang ; Jianwen Zhang ; Junyu Zeng ; Jun Yan ; Zheng Chen ; Zhifang Sui
Abstract: Distant supervision (DS) is an appealing learning method which learns from existing relational facts to extract more from a text corpus. However, the accuracy is still not satisfying. In this paper, we point out and analyze some critical factors in DS which have great impact on accuracy, including valid entity type detection, negative training examples construction and ensembles. We propose an approach to handle these factors. By experimenting on Wikipedia articles to extract the facts in Freebase (the top 92 relations), we show the impact of these three factors on the accuracy of DS and the remarkable improvement led by the proposed approach.
4 0.073924184 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
Author: Conghui Zhu ; Taro Watanabe ; Eiichiro Sumita ; Tiejun Zhao
Abstract: Typical statistical machine translation systems are batch trained with a given training data and their performances are largely influenced by the amount of data. With the growth of the available data across different domains, it is computationally demanding to perform batch training every time when new data comes. In face of the problem, we propose an efficient phrase table combination method. In particular, we train a Bayesian phrasal inversion transduction grammars for each domain separately. The learned phrase tables are hierarchically combined as if they are drawn from a hierarchical Pitman-Yor process. The performance measured by BLEU is at least as comparable to the traditional batch training method. Furthermore, each phrase table is trained separately in each domain, and while computational overhead is significantly reduced by training them in parallel.
5 0.073720306 169 acl-2013-Generating Synthetic Comparable Questions for News Articles
Author: Oleg Rokhlenko ; Idan Szpektor
Abstract: We introduce the novel task of automatically generating questions that are relevant to a text but do not appear in it. One motivating example of its application is for increasing user engagement around news articles by suggesting relevant comparable questions, such as “is Beyonce a better singer than Madonna?”, for the user to answer. We present the first algorithm for the task, which consists of: (a) offline construction of a comparable question template database; (b) ranking of relevant templates to a given article; and (c) instantiation of templates only with entities in the article whose comparison under the template’s relation makes sense. We tested the suggestions generated by our algorithm via a Mechanical Turk experiment, which showed a significant improvement over the strongest baseline of more than 45% in all metrics.
6 0.066785187 237 acl-2013-Margin-based Decomposed Amortized Inference
7 0.066508487 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing
8 0.066477701 269 acl-2013-PLIS: a Probabilistic Lexical Inference System
9 0.065120824 224 acl-2013-Learning to Extract International Relations from Political Context
10 0.063920923 80 acl-2013-Chinese Parsing Exploiting Characters
11 0.063454211 160 acl-2013-Fine-grained Semantic Typing of Emerging Entities
12 0.06323947 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models
13 0.062433701 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
14 0.062073048 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
15 0.061581753 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
16 0.059704918 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
17 0.058146451 245 acl-2013-Modeling Human Inference Process for Textual Entailment Recognition
18 0.057538103 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text
19 0.056562565 242 acl-2013-Mining Equivalent Relations from Linked Data
20 0.05563283 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval
topicId topicWeight
[(0, 0.188), (1, 0.022), (2, -0.028), (3, -0.042), (4, 0.034), (5, 0.063), (6, -0.012), (7, -0.037), (8, 0.002), (9, -0.001), (10, 0.001), (11, -0.021), (12, -0.034), (13, 0.015), (14, -0.002), (15, -0.004), (16, -0.008), (17, 0.02), (18, -0.019), (19, -0.038), (20, -0.002), (21, 0.036), (22, -0.018), (23, 0.072), (24, 0.04), (25, 0.051), (26, -0.009), (27, -0.053), (28, -0.002), (29, -0.022), (30, 0.021), (31, 0.06), (32, 0.013), (33, 0.027), (34, -0.018), (35, 0.047), (36, 0.042), (37, -0.062), (38, -0.026), (39, 0.043), (40, -0.012), (41, -0.018), (42, -0.096), (43, -0.013), (44, 0.036), (45, -0.005), (46, -0.044), (47, 0.034), (48, -0.017), (49, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.94376975 365 acl-2013-Understanding Tables in Context Using Standard NLP Toolkits
Author: Vidhya Govindaraju ; Ce Zhang ; Christopher Re
Abstract: Tabular information in text documents contains a wealth of information, and so tables are a natural candidate for information extraction. There are many cues buried in both a table and its surrounding text that allow us to understand the meaning of the data in a table. We study how natural-language tools, such as part-of-speech tagging, dependency paths, and named-entity recognition, can be used to improve the quality of relation extraction from tables. In three domains we show that (1) a model that performs joint probabilistic inference across tabular and natural language features achieves an F1 score that is twice as high as either a puretable or pure-text system, and (2) using only shallower features or non-joint inference results in lower quality.
2 0.80786103 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction
Author: Wei Xu ; Raphael Hoffmann ; Le Zhao ; Ralph Grishman
Abstract: Distant supervision has attracted recent interest for training information extraction systems because it does not require any human annotation but rather employs existing knowledge bases to heuristically label a training corpus. However, previous work has failed to address the problem of false negative training examples mislabeled due to the incompleteness of knowledge bases. To tackle this problem, we propose a simple yet novel framework that combines a passage retrieval model using coarse features into a state-of-the-art relation extractor using multi-instance learning with fine features. We adapt the information retrieval technique of pseudo- relevance feedback to expand knowledge bases, assuming entity pairs in top-ranked passages are more likely to express a relation. Our proposed technique significantly improves the quality of distantly supervised relation extraction, boosting recall from 47.7% to 61.2% with a consistently high level of precision of around 93% in the experiments.
3 0.71304476 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction
Author: Xingxing Zhang ; Jianwen Zhang ; Junyu Zeng ; Jun Yan ; Zheng Chen ; Zhifang Sui
Abstract: Distant supervision (DS) is an appealing learning method which learns from existing relational facts to extract more from a text corpus. However, the accuracy is still not satisfying. In this paper, we point out and analyze some critical factors in DS which have great impact on accuracy, including valid entity type detection, negative training examples construction and ensembles. We propose an approach to handle these factors. By experimenting on Wikipedia articles to extract the facts in Freebase (the top 92 relations), we show the impact of these three factors on the accuracy of DS and the remarkable improvement led by the proposed approach.
4 0.69917274 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension
Author: Qingqing Cai ; Alexander Yates
Abstract: Supervised training procedures for semantic parsers produce high-quality semantic parsers, but they have difficulty scaling to large databases because of the sheer number of logical constants for which they must see labeled training data. We present a technique for developing semantic parsers for large databases based on a reduction to standard supervised training algorithms, schema matching, and pattern learning. Leveraging techniques from each of these areas, we develop a semantic parser for Freebase that is capable of parsing questions with an F1 that improves by 0.42 over a purely-supervised learning algorithm.
5 0.66672254 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia
Author: Zhigang Wang ; Zhixing Li ; Juanzi Li ; Jie Tang ; Jeff Z. Pan
Abstract: Wikipedia infoboxes are a valuable source of structured knowledge for global knowledge sharing. However, infobox information is very incomplete and imbalanced among the Wikipedias in different languages. It is a promising but challenging problem to utilize the rich structured knowledge from a source language Wikipedia to help complete the missing infoboxes for a target language. In this paper, we formulate the problem of cross-lingual knowledge extraction from multilingual Wikipedia sources, and present a novel framework, called WikiCiKE, to solve this problem. An instancebased transfer learning method is utilized to overcome the problems of topic drift and translation errors. Our experimental results demonstrate that WikiCiKE outperforms the monolingual knowledge extraction method and the translation-based method.
6 0.66439426 160 acl-2013-Fine-grained Semantic Typing of Emerging Entities
7 0.65306908 269 acl-2013-PLIS: a Probabilistic Lexical Inference System
8 0.6478579 178 acl-2013-HEADY: News headline abstraction through event pattern clustering
9 0.64664865 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text
10 0.63353372 61 acl-2013-Automatic Interpretation of the English Possessive
11 0.632851 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing
12 0.62881202 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision
13 0.61278123 169 acl-2013-Generating Synthetic Comparable Questions for News Articles
14 0.61159539 205 acl-2013-Joint Apposition Extraction with Syntactic and Semantic Constraints
15 0.61084479 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction
16 0.60847384 237 acl-2013-Margin-based Decomposed Amortized Inference
17 0.60273647 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia
18 0.58438766 14 acl-2013-A Novel Classifier Based on Quantum Computation
19 0.58431804 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms
20 0.58174109 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users
topicId topicWeight
[(0, 0.06), (6, 0.024), (11, 0.069), (15, 0.018), (21, 0.017), (23, 0.263), (24, 0.046), (26, 0.061), (28, 0.011), (35, 0.079), (42, 0.067), (48, 0.041), (64, 0.01), (70, 0.056), (88, 0.023), (90, 0.025), (95, 0.07)]
simIndex simValue paperId paperTitle
1 0.81092232 209 acl-2013-Joint Modeling of News Readerâ•Žs and Comment Writerâ•Žs Emotions
Author: Huanhuan Liu ; Shoushan Li ; Guodong Zhou ; Chu-ren Huang ; Peifeng Li
Abstract: Emotion classification can be generally done from both the writer’s and reader’s perspectives. In this study, we find that two foundational tasks in emotion classification, i.e., reader’s emotion classification on the news and writer’s emotion classification on the comments, are strongly related to each other in terms of coarse-grained emotion categories, i.e., negative and positive. On the basis, we propose a respective way to jointly model these two tasks. In particular, a cotraining algorithm is proposed to improve semi-supervised learning of the two tasks. Experimental evaluation shows the effectiveness of our joint modeling approach. . 1
same-paper 2 0.75927418 365 acl-2013-Understanding Tables in Context Using Standard NLP Toolkits
Author: Vidhya Govindaraju ; Ce Zhang ; Christopher Re
Abstract: Tabular information in text documents contains a wealth of information, and so tables are a natural candidate for information extraction. There are many cues buried in both a table and its surrounding text that allow us to understand the meaning of the data in a table. We study how natural-language tools, such as part-of-speech tagging, dependency paths, and named-entity recognition, can be used to improve the quality of relation extraction from tables. In three domains we show that (1) a model that performs joint probabilistic inference across tabular and natural language features achieves an F1 score that is twice as high as either a puretable or pure-text system, and (2) using only shallower features or non-joint inference results in lower quality.
3 0.73828858 328 acl-2013-Stacking for Statistical Machine Translation
Author: Majid Razmara ; Anoop Sarkar
Abstract: We propose the use of stacking, an ensemble learning technique, to the statistical machine translation (SMT) models. A diverse ensemble of weak learners is created using the same SMT engine (a hierarchical phrase-based system) by manipulating the training data and a strong model is created by combining the weak models on-the-fly. Experimental results on two language pairs and three different sizes of training data show significant improvements of up to 4 BLEU points over a conventionally trained SMT model.
4 0.70716208 333 acl-2013-Summarization Through Submodularity and Dispersion
Author: Anirban Dasgupta ; Ravi Kumar ; Sujith Ravi
Abstract: We propose a new optimization framework for summarization by generalizing the submodular framework of (Lin and Bilmes, 2011). In our framework the summarization desideratum is expressed as a sum of a submodular function and a nonsubmodular function, which we call dispersion; the latter uses inter-sentence dissimilarities in different ways in order to ensure non-redundancy of the summary. We consider three natural dispersion functions and show that a greedy algorithm can obtain an approximately optimal summary in all three cases. We conduct experiments on two corpora—DUC 2004 and user comments on news articles—and show that the performance of our algorithm outperforms those that rely only on submodularity.
5 0.57498389 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
6 0.56929469 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
7 0.5685786 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
8 0.5670656 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction
9 0.56540811 318 acl-2013-Sentiment Relevance
10 0.56523418 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
11 0.56482512 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
12 0.56373274 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
13 0.5635637 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
14 0.56324959 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
15 0.56286693 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction
16 0.56245375 80 acl-2013-Chinese Parsing Exploiting Characters
17 0.56220794 275 acl-2013-Parsing with Compositional Vector Grammars
18 0.56163591 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
19 0.56144762 18 acl-2013-A Sentence Compression Based Framework to Query-Focused Multi-Document Summarization
20 0.56111073 225 acl-2013-Learning to Order Natural Language Texts