acl acl2011 acl2011-117 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. [sent-9, score-0.699]
2 These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. [sent-10, score-1.336]
3 In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. [sent-11, score-0.189]
4 Experiments show that the accuracy of the extracted entities is improved by 6. [sent-12, score-0.171]
5 1 Introduction The task of this paper is entity set expansion in which the lexicons are expanded from just a few seed entities (Pantel et al. [sent-15, score-0.492]
6 Many set expansion algorithms are based on bootstrapping algorithms, which iteratively acquire new entities. [sent-18, score-0.23]
7 Semantic drift moves the extraction criteria away from the initial criteria demanded by the user and so reduces the accuracy of extraction. [sent-20, score-0.151]
8 Pantel and Pennacchiotti (2006) proposed Espresso, a relation extraction method based on the co-training bootstrapping algorithm with entities and attributes. [sent-21, score-0.313]
9 For achieving this goal, we use a discriminative method instead of a scoring function and incorporate topic information into it. [sent-29, score-0.591]
10 Topic information means the genre of each document as estimated by statistical topic models. [sent-30, score-0.506]
11 In this paper, we effectively utilize topic information in three modules: the first generates the features of the discriminative models; the second selects negative examples; the third prunes incorrect examples from candidate examples for new entities. [sent-31, score-1.14]
12 In Section 2, we illustrate discriminative bootstrapping algorithms and describe their problems. [sent-34, score-0.262]
13 2 Problems of the previous Discriminative Bootstrapping method Some previous works introduced discriminative methods based on the logistic sigmoid classifier, which can utilize arbitrary features for the relation extraction task instead of a scoring function such as Espresso (Bellare et al. [sent-38, score-0.261]
14 reported that the discriminative approach achieves better accuracy than Espresso when the number of extracted pairs is increased because Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-42, score-0.154]
15 The discriminative approach is useful for using arbitrary features, however, they did not identify which feature or features are effective for the methods. [sent-47, score-0.153]
16 Although the context features and attributes partly reduce entity word sense ambiguity, some ambiguous entities remain. [sent-48, score-0.444]
17 For example, consider the domain broadcast program (PRG) and assume that PRG’s attribute is advertisement. [sent-49, score-0.135]
18 ” The entity Android belongs to the cell-phone domain, not PRG, but appears with positive attributes or contexts because many cell-phones are introduced in advertisements as same as broadcast program. [sent-52, score-0.51]
19 the genre of the document, we can distinguish “Android” from PRG and remove such false examples even if the false entity appeared with positive context strings or attributes. [sent-55, score-0.548]
20 Second, they did not solve the problem of negative example selection. [sent-56, score-0.223]
21 Because negative examples are necessary for discriminative training, they used all remaining examples, other than positive examples, as negative examples. [sent-57, score-0.777]
22 Although this is the simplest technique, it is impossible to use all of the examples provided by a large-scale corpus for discriminative training. [sent-58, score-0.217]
23 We solve these three problems by using topic information. [sent-61, score-0.506]
24 1 Basic bootstrapping methods In this section, we describe the basic method adopted from Bellare (Bellare et al. [sent-63, score-0.178]
25 After Ns positive seed entities are manually given, every noun co-occurring with the seed entities is ranked by PMI scores and then selected manually as Na positive attributes. [sent-68, score-0.806]
26 The entity-attribute pairs are obtained by taking the cross product of seed entity lists and attribute lists. [sent-71, score-0.35]
27 The pairs are used as queries for retrieving the positive documents, which include positive pairs. [sent-72, score-0.356]
28 Once positive examples are constructed, discriminative models can be trained by randomly selecting negative examples. [sent-75, score-0.621]
29 Candidate entities are restricted to only the Named Entities that lie in the close proximity to the positive attributes. [sent-76, score-0.312]
30 These candidates of documents, including Named Entity and positive attribute pairs, are regarded as one example the same as the training data. [sent-77, score-0.36]
31 The discriminative models are used to calculate the discriminative positive score, s(e, a), of each candidate pair, {e, a}. [sent-78, score-0.5]
32 Our system extracts Nn types aofn new een ptaiti r,es { ew,aith} high scores a etx teraaccht si tNeration as defined by the summation of s(e, a) of all positive attributes (AP); ∑a∈AP s(e, a). [sent-79, score-0.27]
33 Note that we do not iteratively extra∑cta new attributes because our purpose is entity set expansion. [sent-80, score-0.274]
34 2 Topic features and Topic models In previous studies, context information is only used as the features of discriminative models as we described in Section 2. [sent-82, score-0.189]
35 Our method utilizes not only context features but also topic features. [sent-83, score-0.51]
36 By utilizing topic information, our method can disambiguate the entity word sense and alleviate semantic drift. [sent-84, score-0.776]
37 In order to derive the topic information, we utilize statistical topic models, which represent the relation between documents and words through hidden topics. [sent-85, score-1.041]
38 The topic models can calculate the posterior probability p(z|d) of topic z in document d. [sent-86, score-1.025]
39 For example, ythe p topic omfo tdoeplisc give high probability otor topic z =”cell-phone” in the above example sentences 1. [sent-87, score-0.948]
40 The topic feature value φt(z, e, a) is calculated as follows. [sent-89, score-0.474]
41 In this paper, we use∑ La∑tent Dirichlet Allocation (LDA) as the topic models (Blei et al. [sent-91, score-0.474]
42 LDA represents the latent topics of the documents and the co-occurrence between each topic. [sent-93, score-0.248]
43 In Figure 1, shaded part and the arrows with broken lines indicate our proposed method with its use of topic information including the following sections. [sent-94, score-0.558]
44 3 Negative example selection If we choose negative examples randomly, such examples are harmful for discrimination because some examples include the same contexts or topics as the positive examples. [sent-96, score-0.895]
45 By contrast, negative examples belonging to broad genres are needed to alleviate semantic drift. [sent-97, score-0.397]
46 We use topic information to efficiently select such negative examples. [sent-98, score-0.665]
47 In our method, the negative examples are chosen far from the positive examples according to the measure of topic similarity. [sent-99, score-1.043]
48 For calculating topic similarity, we use a ranking score called “positive t∑opic score”, PT(z), defined as follows, PT(z) = ∑d∈DP p(z|d), where DP indicates the set of posi∑tivde∈ Ddocpu(mze|dn)ts, wanhde p(z|d) is topic posterior probability fcourm a given positive d isoc toupmicen pto. [sent-100, score-1.171]
49 tTehreio bro ptrtoomb50% of the topics sorted in decreasing order of positive topic score are used as the negative topics. [sent-101, score-1.046]
50 Our system picks up as many negative documents as there are positive documents with each selected negative topic being equally represented. [sent-102, score-1.152]
51 space is represented 728 the candidate set by positive attributes, however, this is not enough as described in Section 2. [sent-106, score-0.266]
52 Our candidate pruning module, described below, uses the measure of topic similarity to remove obviously incorrect documents. [sent-107, score-0.642]
53 This pruning module is similar to negative example selection described in the previous section. [sent-108, score-0.384]
54 The positive topic score, PT, is used as a candidate constraint. [sent-109, score-0.74]
55 Taking all positive examples, we select the positive topics, PZ, which including all topics z satisfying the condition PT(z) > th. [sent-110, score-0.487]
56 At least one topic with the largest score is chosen as a positive topic when PT(z) ≤ th about all topics. [sent-111, score-1.126]
57 After selecting thheins positive topic, tbhoeu tdo alclu tmopeincsts. [sent-112, score-0.213]
58 including entity candidates are removed if the posterior probability satisfy p(z|d) ≤ th for all topics z. [sent-113, score-0.401]
59 hi Isn constraint means that the topic of the document matches that of the positive entities and can be regarded as a hard constraint for topic features. [sent-117, score-1.354]
60 The context features were defined using the template “(head) entity (mid. [sent-122, score-0.218]
61 The features have to appear in both the positive and negative training data at least 5 times. [sent-126, score-0.405]
62 In the experiments, we used three domains, car (“CAR”), broadcast program (“PRG”) and sports organization (“SPT”). [sent-127, score-0.229]
63 After running 10 iterations, we obtained 1000 entities in total. [sent-129, score-0.134]
64 , 2011), was used for training 100 mixture topic models and inference. [sent-132, score-0.474]
65 Training corpus for topic models consisted of the content gathered from CARPRGSPT 21. [sent-133, score-0.474]
66 Bold font indicates that the difference between accuracy of the methods in the row and the previous row is significant (P < 0. [sent-148, score-0.204]
67 Second is the first method with the addition of topic features. [sent-157, score-0.474]
68 Third is the second method with the addition of a negative example selection module. [sent-158, score-0.238]
69 Fourth is the third method with the addition of a candidate pruning module (equals the entire shaded part in Figure 1). [sent-159, score-0.276]
70 Each extracted entity is labeled with correct or incorrect by two evaluators based on the results of a commercial search engine. [sent-160, score-0.27]
71 Because the third evaluator checked the two evaluations and confirmed that the examples which were judged as correct by either one of the evaluators were correct, those examples were counted as correct. [sent-163, score-0.288]
72 Using topic features significantly improves accuracy in the CAR and SPT domains. [sent-166, score-0.547]
73 The negative example selection module improves accuracy in the CAR and PRG domains. [sent-167, score-0.341]
74 Also, the candidate pruning method is effective for the CAR and PRG domains. [sent-169, score-0.168]
75 This is because similar entities such as motorcycles are extracted; they have not only the same context but also the same topic as the CAR domain. [sent-171, score-0.608]
76 In the SPT domain, the method with topic features offer significant improvements in accuracy and no further im729 provement was achieved by the other two modules. [sent-172, score-0.579]
77 To confirm whether our modules work properly, we show some characteristic words belonging to each topic that is similar and not similar to target domain in Table 2. [sent-173, score-0.639]
78 Table 2 shows characteristic words for one positive topic zh and two negative topics zl and ze, defined as follow. [sent-174, score-1.179]
79 • • • (the second row) is the topic that maximizes PT(z), which is used as a positive topic. [sent-175, score-0.652]
80 zh (the fourth row) is the topic that minimizes PT(z), which is used as a negative topic. [sent-176, score-0.744]
81 zl ze (the fifth row) is a topic that, we consider, effectively eliminates “drifted entities” extracted by the baseline method. [sent-177, score-0.709]
82 ze is eventually included in the lower half of topic list sorted by PT(z). [sent-178, score-0.661]
83 For utilizing candidate pruning, near topics including zh must be similar to the domain. [sent-181, score-0.347]
84 By contrast, for utilizing negative example selection, the lower half of topics, zl, ze and other negative topics, must be far from the domain. [sent-182, score-0.578]
85 As shown in “CAR” in Table 2, the nearest topic includes “shaken” (automobile inspection) and the farthest topic includes “naika” (internal medicine) which satisfies our expectation. [sent-184, score-0.948]
86 Furthermore, the effective negative topic is similar to the topic of drifted entity sets (digital device). [sent-185, score-1.466]
87 This indicates that our method successfully eliminated drifted entities. [sent-186, score-0.145]
88 Our approach can avoid this problem by using topic models which topic for positive entity-attribute seed pairs. [sent-191, score-1.217]
89 ze is an effective negative topic for eliminating “drifted entities” extracted by the baseline system. [sent-192, score-0.812]
90 Although their approach is similar to ours, our approach is discriminative and so can treat arbitrary features; it is applicable to bootstrapping methods. [sent-196, score-0.262]
91 The accurate selection of negative examples is a major problem for positive and unlabeled learning methods or general bootstrapping methods and some previous works have attempted to reach a solution (Liu et al. [sent-197, score-0.701]
92 However, their methods are hard to apply to the Bootstrapping algorithms because the positive seed set is too small to accurately select negative examples. [sent-200, score-0.46]
93 Our method uses topic information to efficiently solve both the problem of extracting global information and the problem of selecting negative examples. [sent-201, score-0.732]
94 6 Conclusion We proposed an approach to set expansion that uses topic information in three modules and showed that it can improve expansion accuracy. [sent-202, score-0.736]
95 The remaining problem is that the grain size of topic models is not always the same as the target domain. [sent-203, score-0.474]
96 , 2008) incorporated with the topic information using PHITS (Cohn and Chang, 2000), to further enhance entity extraction accuracy. [sent-207, score-0.69]
97 Graph-based analysis of semantic drift in Espresso-like bootstrapping algorithms. [sent-242, score-0.259]
98 PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing. [sent-257, score-0.196]
99 Weaklysupervised acquisition of open-domain classes and class attributes from web documents and query logs. [sent-268, score-0.151]
100 A bootstrapping method for learning semantic lexicons using extraction pattern contexts. [sent-292, score-0.213]
wordName wordTfidf (topN-words)
[('topic', 0.474), ('prg', 0.224), ('negative', 0.191), ('entity', 0.182), ('positive', 0.178), ('car', 0.171), ('ze', 0.147), ('drifted', 0.145), ('bootstrapping', 0.145), ('entities', 0.134), ('bellare', 0.131), ('espresso', 0.131), ('topics', 0.131), ('discriminative', 0.117), ('lda', 0.11), ('android', 0.109), ('spt', 0.109), ('pt', 0.107), ('examples', 0.1), ('attributes', 0.092), ('modules', 0.092), ('seed', 0.091), ('zl', 0.088), ('evaluators', 0.088), ('candidate', 0.088), ('expansion', 0.085), ('pantel', 0.08), ('pruning', 0.08), ('drift', 0.08), ('zh', 0.079), ('attribute', 0.077), ('fuchi', 0.073), ('allocation', 0.071), ('dirichlet', 0.067), ('module', 0.066), ('thelen', 0.064), ('regarded', 0.062), ('komachi', 0.059), ('sarmento', 0.059), ('documents', 0.059), ('broadcast', 0.058), ('latent', 0.058), ('row', 0.056), ('imamura', 0.055), ('font', 0.055), ('adjustment', 0.053), ('mintz', 0.053), ('ntt', 0.05), ('ns', 0.049), ('utilizing', 0.049), ('harmful', 0.048), ('selection', 0.047), ('posterior', 0.045), ('japanese', 0.045), ('ghahramani', 0.045), ('ritter', 0.045), ('pas', 0.045), ('false', 0.044), ('candidates', 0.043), ('discriminate', 0.043), ('shaded', 0.042), ('arrows', 0.042), ('liu', 0.041), ('works', 0.04), ('selectional', 0.04), ('sorted', 0.04), ('na', 0.04), ('characteristic', 0.038), ('pmi', 0.038), ('su', 0.038), ('dp', 0.037), ('alleviate', 0.037), ('blog', 0.037), ('accuracy', 0.037), ('features', 0.036), ('proposal', 0.035), ('suzuki', 0.035), ('belonging', 0.035), ('selecting', 0.035), ('blei', 0.034), ('utilize', 0.034), ('extraction', 0.034), ('semantic', 0.034), ('cohn', 0.034), ('basic', 0.033), ('ap', 0.033), ('solve', 0.032), ('document', 0.032), ('bing', 0.032), ('bro', 0.032), ('okayama', 0.032), ('mpi', 0.032), ('kuniko', 0.032), ('provement', 0.032), ('alleviating', 0.032), ('cta', 0.032), ('cyber', 0.032), ('ez', 0.032), ('giridhar', 0.032), ('mamoru', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
2 0.32273474 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
3 0.26239508 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens
Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.
4 0.24962522 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
5 0.23066998 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
Author: Risa Kitajima ; Ichiro Kobayashi
Abstract: Recently, several latent topic analysis methods such as LSI, pLSI, and LDA have been widely used for text analysis. However, those methods basically assign topics to words, but do not account for the events in a document. With this background, in this paper, we propose a latent topic extracting method which assigns topics to events. We also show that our proposed method is useful to generate a document summary based on a latent topic.
6 0.20298098 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
7 0.19708465 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping
8 0.17970976 14 acl-2011-A Hierarchical Model of Web Summaries
9 0.17183052 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
10 0.14404052 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
11 0.14051737 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base
12 0.12259737 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
13 0.11962568 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
14 0.11952467 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
15 0.11777646 109 acl-2011-Effective Measures of Domain Similarity for Parsing
16 0.11411578 305 acl-2011-Topical Keyphrase Extraction from Twitter
17 0.1108273 82 acl-2011-Content Models with Attitude
18 0.11075458 204 acl-2011-Learning Word Vectors for Sentiment Analysis
19 0.10886085 262 acl-2011-Relation Guided Bootstrapping of Semantic Lexicons
20 0.10446454 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
topicId topicWeight
[(0, 0.261), (1, 0.181), (2, -0.11), (3, 0.14), (4, 0.031), (5, -0.114), (6, -0.115), (7, 0.191), (8, -0.095), (9, 0.065), (10, -0.063), (11, 0.04), (12, 0.141), (13, -0.065), (14, 0.257), (15, 0.031), (16, 0.019), (17, -0.088), (18, -0.055), (19, 0.029), (20, -0.052), (21, 0.101), (22, -0.039), (23, 0.054), (24, 0.01), (25, 0.037), (26, 0.057), (27, 0.148), (28, -0.071), (29, 0.058), (30, 0.064), (31, -0.009), (32, 0.02), (33, 0.052), (34, 0.044), (35, 0.085), (36, 0.082), (37, -0.031), (38, -0.016), (39, 0.018), (40, 0.02), (41, -0.01), (42, -0.109), (43, -0.005), (44, -0.034), (45, 0.051), (46, 0.104), (47, 0.002), (48, -0.044), (49, -0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.97964382 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
2 0.87436771 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
3 0.85424006 178 acl-2011-Interactive Topic Modeling
Author: Yuening Hu ; Jordan Boyd-Graber ; Brianna Satinoff
Abstract: Topic models have been used extensively as a tool for corpus exploration, and a cottage industry has developed to tweak topic models to better encode human intuitions or to better model data. However, creating such extensions requires expertise in machine learning unavailable to potential end-users of topic modeling software. In this work, we develop a framework for allowing users to iteratively refine the topics discovered by models such as latent Dirichlet allocation (LDA) by adding constraints that enforce that sets of words must appear together in the same topic. We incorporate these constraints interactively by selectively removing elements in the state of a Markov Chain used for inference; we investigate a variety of methods for incorporating this information and demonstrate that these interactively added constraints improve topic usefulness for simulated and actual user sessions.
4 0.83990806 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis
Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai
Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,
5 0.77271539 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens
Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.
6 0.75579077 14 acl-2011-A Hierarchical Model of Web Summaries
7 0.73376155 305 acl-2011-Topical Keyphrase Extraction from Twitter
8 0.72013366 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
9 0.6764431 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
10 0.65416342 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content
11 0.56569439 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
12 0.54660833 148 acl-2011-HITS-based Seed Selection and Stop List Construction for Bootstrapping
13 0.51782489 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues
14 0.47813976 54 acl-2011-Automatically Extracting Polarity-Bearing Topics for Cross-Domain Sentiment Classification
15 0.47232309 109 acl-2011-Effective Measures of Domain Similarity for Parsing
16 0.44610059 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
17 0.43739644 101 acl-2011-Disentangling Chat with Local Coherence Models
18 0.4190062 82 acl-2011-Content Models with Attitude
19 0.41882202 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
20 0.41346765 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
topicId topicWeight
[(5, 0.029), (17, 0.057), (26, 0.025), (37, 0.102), (39, 0.087), (41, 0.074), (52, 0.161), (53, 0.014), (55, 0.023), (59, 0.043), (72, 0.033), (91, 0.089), (96, 0.173), (97, 0.015)]
simIndex simValue paperId paperTitle
Author: Miao Chen ; Klaus Zechner
Abstract: This paper focuses on identifying, extracting and evaluating features related to syntactic complexity of spontaneous spoken responses as part of an effort to expand the current feature set of an automated speech scoring system in order to cover additional aspects considered important in the construct of communicative competence. Our goal is to find effective features, selected from a large set of features proposed previously and some new features designed in analogous ways from a syntactic complexity perspective that correlate well with human ratings of the same spoken responses, and to build automatic scoring models based on the most promising features by using machine learning methods. On human transcriptions with manually annotated clause and sentence boundaries, our best scoring model achieves an overall Pearson correlation with human rater scores of r=0.49 on an unseen test set, whereas correlations of models using sentence or clause boundaries from automated classifiers are around r=0.2. 1
2 0.86801457 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
Author: Harr Chen ; Edward Benson ; Tahira Naseem ; Regina Barzilay
Abstract: We present a novel approach to discovering relations and their instantiations from a collection of documents in a single domain. Our approach learns relation types by exploiting meta-constraints that characterize the general qualities of a good relation in any domain. These constraints state that instances of a single relation should exhibit regularities at multiple levels of linguistic structure, including lexicography, syntax, and document-level context. We capture these regularities via the structure of our probabilistic model as well as a set of declaratively-specified constraints enforced during posterior inference. Across two domains our approach successfully recovers hidden relation structure, comparable to or outperforming previous state-of-the-art approaches. Furthermore, we find that a small , set of constraints is applicable across the domains, and that using domain-specific constraints can further improve performance. 1
same-paper 3 0.86575341 117 acl-2011-Entity Set Expansion using Topic information
Author: Kugatsu Sadamitsu ; Kuniko Saito ; Kenji Imamura ; Genichiro Kikui
Abstract: This paper proposes three modules based on latent topics of documents for alleviating “semantic drift” in bootstrapping entity set expansion. These new modules are added to a discriminative bootstrapping algorithm to realize topic feature generation, negative example selection and entity candidate pruning. In this study, we model latent topics with LDA (Latent Dirichlet Allocation) in an unsupervised way. Experiments show that the accuracy of the extracted entities is improved by 6.7 to 28.2% depending on the domain.
4 0.83138013 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
Author: Zhongguo Li
Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂 擌 奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂 䀓 惼 ‘vice director’ and 䉂 䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂 擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂 擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱 嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌 撥 怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥 惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩 扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩 堑 扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌 撥 怂 惆 VV ???HHH VV NNf ZZ VVf VVf 扩 扙 堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽 䊂䠽 吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.
5 0.81802976 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
Author: Nathan Bodenstab ; Aaron Dunlop ; Keith Hall ; Brian Roark
Abstract: Efficient decoding for syntactic parsing has become a necessary research area as statistical grammars grow in accuracy and size and as more NLP applications leverage syntactic analyses. We review prior methods for pruning and then present a new framework that unifies their strengths into a single approach. Using a log linear model, we learn the optimal beam-search pruning parameters for each CYK chart cell, effectively predicting the most promising areas of the model space to explore. We demonstrate that our method is faster than coarse-to-fine pruning, exemplified in both the Charniak and Berkeley parsers, by empirically comparing our parser to the Berkeley parser using the same grammar and under identical operating conditions.
6 0.81609154 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
7 0.81573677 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
8 0.81513637 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
9 0.81357563 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
10 0.81221849 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
11 0.81139719 28 acl-2011-A Statistical Tree Annotator and Its Applications
12 0.8106972 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
13 0.80850852 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
14 0.8077476 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
15 0.80705047 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing
17 0.80446589 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
18 0.80416727 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
19 0.80340457 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
20 0.80306792 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation