emnlp emnlp2010 emnlp2010-112 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Tara McIntosh
Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.
Reference: text
sentIndex sentText sentNum sentScore
1 Unsupervised discovery of negative categories in lexicon bootstrapping Tara McIntosh NICTA Victoria Research Lab Dept of Computer Science and Software Engineering University of Melbourne nlp@t aramcint o sh . [sent-1, score-1.257]
2 The best results have been achieved through reliance on manually crafted negative categories. [sent-4, score-0.644]
3 We present NEG-FINDER, the first approach for discovering negative categories automatically. [sent-6, score-0.858]
4 NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. [sent-7, score-0.996]
5 Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert. [sent-8, score-1.42]
6 Unfortunately, semantic drift often occurs when ambiguous or erroneous terms and/or patterns are introduced into the iterative process (Curran et al. [sent-17, score-0.672]
7 In multi-category bootstrapping, semantic drift is often reduced when the target categories compete with each other for terms and/or patterns (Yangarber et al. [sent-19, score-1.133]
8 To ensure this, manually crafted negative categories are introduced (Lin et al. [sent-22, score-1.05]
9 The design of negative categories is a very time consuming task. [sent-26, score-0.837]
10 It typically requires a domain expert to identify the semantic drift and its cause, followed by a significant amount of trial and error in order to select the most suitable combination of negative categories. [sent-27, score-1.152]
11 We show that although excellent performance is achieved using negative categories, it varies greatly depending on the negative categories selected. [sent-29, score-1.3]
12 This highlights the difficulty of crafting negative categories and thus the necessity for tools that can automatically identify them. [sent-30, score-0.935]
13 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 3t5ic6s–365, negative categories automatically. [sent-33, score-0.837]
14 During bootstrapping, efficient clustering techniques are applied to sets of drifted candidate terms to generate new negative categories. [sent-34, score-0.905]
15 Once a negative category is identified it is incorporated into the subsequent iterations whereby it provides the necessary semantic boundaries for the target categories. [sent-35, score-0.863]
16 NEG-FINDER significantly outperforms bootstrapping prior to the domain expert’s negative categories. [sent-37, score-0.717]
17 Our methods effectively remove the necessity of manual intervention and formulation of negative categories in semantic lexicon bootstrapping. [sent-39, score-1.128]
18 This often causes semantic drift when a lexicon’s intended meaning shifts into another category during bootstrapping (Curran et al. [sent-45, score-0.876]
19 , 2002), and WMEB (McIntosh and Curran, 2008), aim to reduce semantic drift by extracting multiple semantic categories simultaneously. [sent-49, score-1.019]
20 1 Weighted MEB In Weighted Mutual Exclusion Bootstrapping (WMEB, McIntosh and Curran, 2008), multiple semantic categories iterate simultaneously between the term and pattern extraction phases, competing with each other for terms and patterns. [sent-54, score-0.689]
21 Semantic drift is reduced by forcing the categories to be mutually exclusive. [sent-55, score-0.769]
22 To ensure mutual exclusion between the categories, candidate patterns that are identified by multiple categories in an iteration are excluded. [sent-60, score-0.67]
23 2 Detecting semantic drift in WMEB In McIntosh and Curran (2009), we showed that multi-category bootstrappers are still prone to se- mantic drift in the later iterations. [sent-72, score-0.9]
24 We proposed a drift detection metric based on our hypothesis that semantic drift occurs when a candidate term is more similar to the recently added terms than to the seed and high precision terms extracted in the earlier iterations. [sent-73, score-1.297]
25 The drift metric is defined as the ratio of the average distributional similarity of the candidate term to the first n terms extracted into the lexicon L, and to the last m terms extracted in the previous iterations: drift(term,n,m) =avgsaimvg(sLim(N(L−m1. [sent-75, score-0.892]
26 3 Negative categories In multi-category bootstrapping, improvements in precision arise when semantic boundaries between multiple target categories are established. [sent-82, score-0.908]
27 Unfortunately, it is difficult to predict if a tar- get category will suffer from semantic drift and/or whether it will naturally compete with the other target categories. [sent-84, score-0.755]
28 Once a domain expert establishes semantic drift and its possible cause, a set of negative/stop categories that may be of no direct interest are manually crafted to prevent semantic drift. [sent-85, score-1.306]
29 These additional categories are then exploited during another round of bootstrapping to provide further competition for the target categories (Lin et al. [sent-86, score-1.035]
30 (2003) improved NOMEN’s performance for extracting diseases and locations from the ProMED corpus by incorporating negative categories into the bootstrapping process. [sent-90, score-1.075]
31 This single negative category resulted in substantial improvements in precision. [sent-92, score-0.626]
32 In their final experiment, six negative categories that were notable sources of semantic drift were identified, and the inclusion of these lead to further performance improvements (∼20%). [sent-93, score-1.342]
33 (2007) and McIntosh (2010) manually crafted negative categories that were necessary to prevent semantic drift. [sent-96, score-1.128]
34 In particular, in McIntosh (2010), a biomedical expert spent considerable time (∼15 days) and effort 358 Figure 1: NEG-FINDER: Local negative discovery identifying potential negative categories and subsequently optimising their associated seeds in trial and error bootstrapping runs. [sent-97, score-1.843]
35 By introducing manually crafted negative categories, a significant amount ofexpert domain knowl- edge is introduced. [sent-98, score-0.69]
36 To discover negative categories during bootstrapping, NEG-FINDER must identify a representative cluster of the drifted terms. [sent-105, score-1.156]
37 In this section, we present the two types of clustering used (maximum and outlier), and our three different levels of negative discovery (local, global and mixture). [sent-106, score-0.697]
38 1 Discovering negative categories We have observed that semantic drift begins to dominate when clusters of incorrect terms with similar meanings are extracted. [sent-108, score-1.451]
39 In NEG-FINDER, these drifted terms are cached as they may provide adequate seed terms for new negative categories. [sent-111, score-0.956]
40 However, the drifted terms can also include scattered polysemous or correct terms that share little similarity with the other drifted terms. [sent-112, score-0.646]
41 Therefore, simply using the first set of drifted terms to establish a negative category is likely to introduce noise rather than a cohesive competing category. [sent-113, score-0.975]
42 To discover negative categories, we exploit hierarchical clustering to group similar terms within the cache of drifted terms. [sent-114, score-1.022]
43 To ensure adequate coverage of the possible drifting topics, negative discovery and hence clustering is only performed when the drift cache consists of at least 20 terms. [sent-120, score-1.295]
44 In our next clustering method, we aim to form a negative category with as little similarity to the target seeds. [sent-125, score-0.794]
45 We use an outlier clustering strategy, in which the drifted term t with the least average distri359 butional similarity to the first n terms in the lexicon must be contained in the cluster of seeds. [sent-126, score-0.694]
46 3 Incorporating the negative category After a cluster of negative seed terms is established, the drift cache is cleared, and a new negative category is created and introduced into the iterative bootstrapping process in the next iteration. [sent-130, score-2.692]
47 The negative categories can compete with all other categories, including any previously introduced negative categories, however the negative categories do not con- tribute to the drift caches. [sent-132, score-2.569]
48 For this, the complete set of extracting patterns matching any of the negative seeds are considered and ranked with respect to the seeds. [sent-134, score-0.645]
49 The top scoring patterns are considered sequentially until m patterns are assigned to the new negative category. [sent-135, score-0.627]
50 To ensure mutual exclusion between the new category and the target categories, a candidate pattern that has previously been selected by a target category cannot be used to extract terms for either category in the subsequent iterations. [sent-136, score-0.875]
51 4 Levels of negative discovery Negative category discovery can be performed at a local or global level, or as a mixture of both. [sent-138, score-1.004]
52 In local discovery, each target category has its own drifted term cache and can generate negative categories irrespective of the other target categories. [sent-139, score-1.55]
53 The drifted terms (shaded) are extracted away from the lexicon into the local drift cache, which is then clustered. [sent-141, score-0.872]
54 Target cate- gories can also generate multiple negative categories across different iterations. [sent-143, score-0.86]
55 In global discovery, all drifted terms are pooled into a global cache, from which a single negative category can be identified in an iteration. [sent-144, score-1.059]
56 may be drifting into similar semantic categories, and enables these otherwise missed negative categories to be established. [sent-147, score-1.021]
57 In the mixture discovery method, both global and local negative categories can be formed. [sent-148, score-1.11]
58 Once a local negative category is formed, the terms within the local cache are cleared and also removed from the global cache. [sent-151, score-0.998]
59 This prevents multiple negative categories being instantiated with overlapping seed terms. [sent-152, score-0.959]
60 4 Experimental setup To compare the effectiveness of our negative discovery approaches we consider the task of extracting biomedical semantic lexicons from raw text. [sent-153, score-0.81]
61 2 Semantic categories The semantic categories we extract from MEDLINE were inspired by the TREC Genomics entities (Hersh et al. [sent-166, score-0.858]
62 3 Negative categories In our experiments, we use two different sets of negative categories. [sent-171, score-0.837]
63 The first set corresponds to those used in McIntosh and Curran (2008), and were identified by a domain expert as common sources of semantic drift in preliminary experiments with MEB and WMEB. [sent-173, score-0.681]
64 The ANIMAL and BODY PART categories were formed with the intention of preventing drift in the CELL, DISE and SIGN categories. [sent-175, score-0.809]
65 The ORGANISM category was then created to reduce the new drift forming in the DISE category after the first set of negative categories were introduced. [sent-176, score-1.558]
66 The second set of negative categories was identified by an independent domain expert with limited CATEGORYSEED TERMS 1AANMIIMNOAL ACIDianrsgiecntin mea cmysmteainle m gicleyc minoeus gelu rtaatmsate histamine knowledge of NLP and bootstrapping. [sent-177, score-1.013]
67 Unless otherwise stated, no hand-picked negative categories are used. [sent-202, score-0.837]
68 To ensure infrequent terms are not used to seed negative categories, drifted terms must occur at least 50 times to be re- tained in the drift cache. [sent-208, score-1.383]
69 Negative category discovery is only initiated when the drifted cache contains at least 20 terms, and a minimum of 5 terms are used to seed a negative category. [sent-209, score-1.304]
70 1 Influence of negative categories In our first experiments, we investigate the performance variations and improvements gained using negative categories selected by two independent domain experts. [sent-217, score-1.742]
71 Table 4 shows WMEB-DRIFT’s average precision over the 10 target categories with and without the two negative category sets. [sent-218, score-1.05]
72 This demonstrates the difficulty of selecting appropriate negative categories and seeds for the task, and in turn the necessity for tools to discover them automatically. [sent-220, score-1.008]
73 3 The first discovery approach corresponds to the na¨ ıve NEG-FINDER system that generates local negative categories from the first five drifted terms. [sent-225, score-1.201]
74 Compared to local discovery, global discovery is capable of detecting new negative categories earlier, and the categories it detects are more 3Statistical significance was tested using computationallyintensive randomisation tests (Cohen, 1995). [sent-230, score-1.396]
75 The NEG-FINDER mixture approach, which benefits from both local and global discovery, identifies the most useful negative categories. [sent-232, score-0.631]
76 Table 6 shows the seven discovered categories two local negative categories from CELL and TUMOUR, and five global categories were formed. [sent-233, score-1.754]
77 These results demonstrate that suitable negative categories can be identified and exploited during bootstrapping. [sent-239, score-0.906]
78 3 Boosting hand-picked negative categories In our next set of experiments, we investigate whether NEG-FINDER can improve state-of-theart performance by identifying new negative categories in addition to the manually selected negative TaNWb+lEMenG8-gF:BaItNiDPvRerIEf1FoRT+mlgacionxbteulrf60N71E4G-. [sent-241, score-2.166]
79 Both NEG-FINDER and WMEB-DRIFT are initialised with the 10 target categories and the first set of negative categories. [sent-245, score-0.931]
80 4 Restarting with new negative categories The performance improvements so far using NEGFINDER have been limited by the time at which new negative categories are discovered and incorporated into the bootstrapping process. [sent-250, score-2.008]
81 That is, system improvements can only be gained from the negative categories after they are generated. [sent-251, score-0.859]
82 For example, in Local NEG-FINDER, five negative categories are discovered in iterations 83, 85, 126, 130 and 150. [sent-252, score-0.926]
83 On the other hand, in the WMEB-DRIFT +negative experiments (Table 8 row 2), the hand-picked negative categories can start competing with the target categories in the very first iteration of bootstrapping. [sent-253, score-1.303]
84 Table 7 shows the average precision of WMEB-DRIFT over the 10 target categories when 363 it is restarted with the new negative categories discovered from our three approaches (using maximum clustering). [sent-255, score-1.35]
85 Over the first 200 terms, significant improvements are gained using the new negative categories (+6%). [sent-256, score-0.859]
86 However, the manually selected categories are far superior in preventing drift (+11%). [sent-257, score-0.798]
87 This may be attributed by the target categories not strongly drifting into the new negative categories until the later stages, whereas the hand-picked categories were selected on the basis of observed drift in the early stages (over the first 500 terms). [sent-258, score-2.104]
88 Table 7 shows that each of the discovered negative sets can significantly outperform the negative categories selected by a domain expert (negative set 2) (+0. [sent-266, score-1.525]
89 The discovered negative categories are more effective than the manually crafted sets in reducing semantic drift in the ANTIBODY, CELL and DISEASE lexicons. [sent-279, score-1.612]
90 NEG-FINDER also significantly boosts the performance of the original negative categories by identifying additional negative categories (row 5). [sent-328, score-1.674]
91 Our final experiment, where WMEB-DRIFT is re-initialised with the negative categories discovered by NEG-FINDER, further demonstrates the utility of our method. [sent-329, score-0.926]
92 On average, the discovered negative categories significantly outperform the manually crafted negative categories. [sent-330, score-1.57]
93 6 Conclusion In this paper, we have proposed the first completely unsupervised approach to identifying the negative categories that are necessary for bootstrapping large yet precise semantic lexicons. [sent-331, score-1.177]
94 Prior to this work, negative categories were manually crafted by a domain expert, undermining the advantages of an unsupervised bootstrapping paradigm. [sent-332, score-1.294]
95 We intend to use sophisticated clustering methods, such as CBC (Pantel, 2003), to identify multiple negative categories across the target categories in a single iteration. [sent-334, score-1.37]
96 Our initial analysis demonstrated that although excellent performance is achieved using negative categories, large performance variations occur when using categories crafted by different domain experts. [sent-336, score-1.035]
97 8 Table 10: Random seed results negative categories during bootstrapping. [sent-349, score-0.938]
98 NEGFINDER identifies cohesive negative categories and many of these are semantically similar to those identified by domain experts. [sent-350, score-0.923]
99 NEG-FINDER significantly outperforms the stateof-the-art algorithm WMEB-DRIFT, before negative categories are crafted, by up to 5. [sent-351, score-0.837]
100 The new discovered categories can also be fully exploited in bootstrapping, where they successfully outperform a domain expert’s negative categories and approach that of another expert. [sent-354, score-1.375]
wordName wordTfidf (topN-words)
[('negative', 0.463), ('drift', 0.395), ('categories', 0.374), ('drifted', 0.222), ('mcintosh', 0.222), ('bootstrapping', 0.208), ('category', 0.163), ('crafted', 0.152), ('cache', 0.14), ('curran', 0.129), ('semantic', 0.11), ('lexicon', 0.107), ('discovery', 0.105), ('seed', 0.101), ('expert', 0.09), ('discovered', 0.089), ('wmeb', 0.089), ('mixture', 0.088), ('clustering', 0.086), ('terms', 0.085), ('patterns', 0.082), ('restart', 0.076), ('drifting', 0.074), ('tara', 0.074), ('seeds', 0.07), ('exclusion', 0.063), ('outlier', 0.063), ('yangarber', 0.059), ('lexicons', 0.057), ('riloff', 0.054), ('term', 0.051), ('target', 0.05), ('necessity', 0.049), ('candidate', 0.049), ('cluster', 0.048), ('domain', 0.046), ('biomedical', 0.045), ('initialised', 0.044), ('global', 0.043), ('medline', 0.042), ('competing', 0.042), ('formed', 0.04), ('cell', 0.04), ('identified', 0.04), ('dise', 0.038), ('australian', 0.038), ('local', 0.037), ('compete', 0.037), ('incorporated', 0.037), ('distributional', 0.036), ('reliability', 0.034), ('similarity', 0.032), ('ensure', 0.032), ('pool', 0.031), ('extracting', 0.03), ('mutual', 0.03), ('bootstrapper', 0.03), ('cleared', 0.03), ('genomics', 0.03), ('grover', 0.03), ('meb', 0.03), ('negfinder', 0.03), ('nicta', 0.03), ('nomen', 0.03), ('organism', 0.03), ('ravichandran', 0.03), ('tumour', 0.03), ('tumr', 0.03), ('winston', 0.03), ('manually', 0.029), ('exploited', 0.029), ('carlson', 0.028), ('jones', 0.028), ('pattern', 0.027), ('ellen', 0.027), ('discover', 0.026), ('tools', 0.026), ('extracted', 0.026), ('thelen', 0.025), ('roman', 0.025), ('trial', 0.025), ('hersh', 0.025), ('intervention', 0.025), ('initiated', 0.025), ('pantel', 0.025), ('clusters', 0.024), ('female', 0.023), ('randomised', 0.023), ('gories', 0.023), ('identify', 0.023), ('unsupervised', 0.022), ('lin', 0.022), ('sign', 0.022), ('phase', 0.022), ('gained', 0.022), ('instantiated', 0.021), ('animal', 0.021), ('xml', 0.021), ('minimally', 0.021), ('discovering', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999893 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping
Author: Tara McIntosh
Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.
2 0.19865093 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification
Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng
Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.
3 0.12192332 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
4 0.088859737 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text
Author: Amit Goyal ; Ellen Riloff ; Hal Daume III
Abstract: In the 1980s, plot units were proposed as a conceptual knowledge structure for representing and summarizing narrative stories. Our research explores whether current NLP technology can be used to automatically produce plot unit representations for narrative text. We create a system called AESOP that exploits a variety of existing resources to identify affect states and applies “projection rules” to map the affect states onto the characters in a story. We also use corpus-based techniques to generate a new type of affect knowledge base: verbs that impart positive or negative states onto their patients (e.g., being eaten is an undesirable state, but being fed is a desirable state). We harvest these “patient polarity verbs” from a Web corpus using two techniques: co-occurrence with Evil/Kind Agent patterns, and bootstrapping over conjunctions of verbs. We evaluate the plot unit representations produced by our system on a small collection of Aesop’s fables.
5 0.088810518 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
Author: Mark Dredze ; Tim Oates ; Christine Piatko
Abstract: Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention – detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses Aodifst uannlaceb,e a dm eextraicm fpolre detecting tshhoifdts u sine sd Aatastreams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.
6 0.080534719 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
7 0.075840183 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
8 0.070212498 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa
9 0.068593882 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification
10 0.065598428 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions
11 0.063857757 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
12 0.058810163 84 emnlp-2010-NLP on Spoken Documents Without ASR
13 0.057312321 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
14 0.052084107 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web
15 0.05206608 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text
16 0.05087484 40 emnlp-2010-Effects of Empty Categories on Machine Translation
17 0.049821362 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping
18 0.049812127 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
19 0.045529943 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
20 0.043867093 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
topicId topicWeight
[(0, 0.166), (1, 0.113), (2, -0.063), (3, 0.142), (4, 0.046), (5, -0.0), (6, 0.041), (7, 0.022), (8, 0.042), (9, 0.129), (10, 0.173), (11, -0.084), (12, -0.146), (13, -0.18), (14, 0.028), (15, 0.048), (16, 0.05), (17, 0.052), (18, -0.217), (19, -0.091), (20, -0.304), (21, -0.07), (22, -0.167), (23, -0.077), (24, 0.085), (25, 0.181), (26, -0.041), (27, -0.045), (28, 0.081), (29, -0.099), (30, 0.133), (31, -0.066), (32, -0.064), (33, -0.098), (34, -0.09), (35, -0.12), (36, -0.116), (37, -0.089), (38, -0.08), (39, -0.08), (40, 0.121), (41, 0.045), (42, -0.003), (43, 0.052), (44, 0.032), (45, -0.12), (46, 0.008), (47, -0.031), (48, -0.032), (49, -0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.98992026 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping
Author: Tara McIntosh
Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.
2 0.70955157 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification
Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng
Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.
3 0.44605413 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text
Author: Amit Goyal ; Ellen Riloff ; Hal Daume III
Abstract: In the 1980s, plot units were proposed as a conceptual knowledge structure for representing and summarizing narrative stories. Our research explores whether current NLP technology can be used to automatically produce plot unit representations for narrative text. We create a system called AESOP that exploits a variety of existing resources to identify affect states and applies “projection rules” to map the affect states onto the characters in a story. We also use corpus-based techniques to generate a new type of affect knowledge base: verbs that impart positive or negative states onto their patients (e.g., being eaten is an undesirable state, but being fed is a desirable state). We harvest these “patient polarity verbs” from a Web corpus using two techniques: co-occurrence with Evil/Kind Agent patterns, and bootstrapping over conjunctions of verbs. We evaluate the plot unit representations produced by our system on a small collection of Aesop’s fables.
4 0.40599895 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
Author: Ahmed Hassan ; Vahed Qazvinian ; Dragomir Radev
Abstract: Mining sentiment from user generated content is a very important task in Natural Language Processing. An example of such content is threaded discussions which act as a very important tool for communication and collaboration in the Web. Threaded discussions include e-mails, e-mail lists, bulletin boards, newsgroups, and Internet forums. Most of the work on sentiment analysis has been centered around finding the sentiment toward products or topics. In this work, we present a method to identify the attitude of participants in an online discussion toward one another. This would enable us to build a signed network representation of participant interaction where every edge has a sign that indicates whether the interaction is positive or negative. This is different from most of the research on social networks that has focused almost exclusively on positive links. The method is exper- imentally tested using a manually labeled set of discussion posts. The results show that the proposed method is capable of identifying attitudinal sentences, and their signs, with high accuracy and that it outperforms several other baselines.
5 0.36181995 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
6 0.31650761 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
7 0.31613559 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
8 0.30631542 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification
9 0.29076004 84 emnlp-2010-NLP on Spoken Documents Without ASR
10 0.26411057 40 emnlp-2010-Effects of Empty Categories on Machine Translation
11 0.25175193 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
12 0.2326715 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text
13 0.22866111 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web
14 0.22338971 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa
15 0.21665782 114 emnlp-2010-Unsupervised Parse Selection for HPSG
16 0.20209146 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
17 0.20126225 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
18 0.19034314 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
19 0.18565938 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding
topicId topicWeight
[(3, 0.014), (10, 0.015), (12, 0.028), (29, 0.071), (30, 0.477), (32, 0.012), (52, 0.014), (56, 0.04), (62, 0.014), (66, 0.089), (72, 0.055), (76, 0.024), (82, 0.03), (87, 0.011), (89, 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.92162871 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping
Author: Tara McIntosh
Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.
2 0.754067 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation
Author: Aurelien Max
Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.
Author: Amr Ahmed ; Eric Xing
Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.
4 0.51433182 31 emnlp-2010-Constraints Based Taxonomic Relation Classification
Author: Quang Do ; Dan Roth
Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.
5 0.44162217 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing
Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.
6 0.43950653 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts
7 0.43477622 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
8 0.43039566 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
9 0.41373435 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification
10 0.40785533 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
11 0.39971724 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices
12 0.39806378 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web
13 0.38581195 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
14 0.38226882 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
15 0.3807469 51 emnlp-2010-Function-Based Question Classification for General QA
16 0.37487242 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
17 0.37186962 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text
18 0.37166589 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
19 0.37089732 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition
20 0.37037686 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors