emnlp emnlp2010 emnlp2010-112 knowledge-graph by maker-knowledge-mining

112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping


Source: pdf

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Unsupervised discovery of negative categories in lexicon bootstrapping Tara McIntosh NICTA Victoria Research Lab Dept of Computer Science and Software Engineering University of Melbourne nlp@t aramcint o sh . [sent-1, score-1.257]

2 The best results have been achieved through reliance on manually crafted negative categories. [sent-4, score-0.644]

3 We present NEG-FINDER, the first approach for discovering negative categories automatically. [sent-6, score-0.858]

4 NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. [sent-7, score-0.996]

5 Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert. [sent-8, score-1.42]

6 Unfortunately, semantic drift often occurs when ambiguous or erroneous terms and/or patterns are introduced into the iterative process (Curran et al. [sent-17, score-0.672]

7 In multi-category bootstrapping, semantic drift is often reduced when the target categories compete with each other for terms and/or patterns (Yangarber et al. [sent-19, score-1.133]

8 To ensure this, manually crafted negative categories are introduced (Lin et al. [sent-22, score-1.05]

9 The design of negative categories is a very time consuming task. [sent-26, score-0.837]

10 It typically requires a domain expert to identify the semantic drift and its cause, followed by a significant amount of trial and error in order to select the most suitable combination of negative categories. [sent-27, score-1.152]

11 We show that although excellent performance is achieved using negative categories, it varies greatly depending on the negative categories selected. [sent-29, score-1.3]

12 This highlights the difficulty of crafting negative categories and thus the necessity for tools that can automatically identify them. [sent-30, score-0.935]

13 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 3t5ic6s–365, negative categories automatically. [sent-33, score-0.837]

14 During bootstrapping, efficient clustering techniques are applied to sets of drifted candidate terms to generate new negative categories. [sent-34, score-0.905]

15 Once a negative category is identified it is incorporated into the subsequent iterations whereby it provides the necessary semantic boundaries for the target categories. [sent-35, score-0.863]

16 NEG-FINDER significantly outperforms bootstrapping prior to the domain expert’s negative categories. [sent-37, score-0.717]

17 Our methods effectively remove the necessity of manual intervention and formulation of negative categories in semantic lexicon bootstrapping. [sent-39, score-1.128]

18 This often causes semantic drift when a lexicon’s intended meaning shifts into another category during bootstrapping (Curran et al. [sent-45, score-0.876]

19 , 2002), and WMEB (McIntosh and Curran, 2008), aim to reduce semantic drift by extracting multiple semantic categories simultaneously. [sent-49, score-1.019]

20 1 Weighted MEB In Weighted Mutual Exclusion Bootstrapping (WMEB, McIntosh and Curran, 2008), multiple semantic categories iterate simultaneously between the term and pattern extraction phases, competing with each other for terms and patterns. [sent-54, score-0.689]

21 Semantic drift is reduced by forcing the categories to be mutually exclusive. [sent-55, score-0.769]

22 To ensure mutual exclusion between the categories, candidate patterns that are identified by multiple categories in an iteration are excluded. [sent-60, score-0.67]

23 2 Detecting semantic drift in WMEB In McIntosh and Curran (2009), we showed that multi-category bootstrappers are still prone to se- mantic drift in the later iterations. [sent-72, score-0.9]

24 We proposed a drift detection metric based on our hypothesis that semantic drift occurs when a candidate term is more similar to the recently added terms than to the seed and high precision terms extracted in the earlier iterations. [sent-73, score-1.297]

25 The drift metric is defined as the ratio of the average distributional similarity of the candidate term to the first n terms extracted into the lexicon L, and to the last m terms extracted in the previous iterations: drift(term,n,m) =avgsaimvg(sLim(N(L−m1. [sent-75, score-0.892]

26 3 Negative categories In multi-category bootstrapping, improvements in precision arise when semantic boundaries between multiple target categories are established. [sent-82, score-0.908]

27 Unfortunately, it is difficult to predict if a tar- get category will suffer from semantic drift and/or whether it will naturally compete with the other target categories. [sent-84, score-0.755]

28 Once a domain expert establishes semantic drift and its possible cause, a set of negative/stop categories that may be of no direct interest are manually crafted to prevent semantic drift. [sent-85, score-1.306]

29 These additional categories are then exploited during another round of bootstrapping to provide further competition for the target categories (Lin et al. [sent-86, score-1.035]

30 (2003) improved NOMEN’s performance for extracting diseases and locations from the ProMED corpus by incorporating negative categories into the bootstrapping process. [sent-90, score-1.075]

31 This single negative category resulted in substantial improvements in precision. [sent-92, score-0.626]

32 In their final experiment, six negative categories that were notable sources of semantic drift were identified, and the inclusion of these lead to further performance improvements (∼20%). [sent-93, score-1.342]

33 (2007) and McIntosh (2010) manually crafted negative categories that were necessary to prevent semantic drift. [sent-96, score-1.128]

34 In particular, in McIntosh (2010), a biomedical expert spent considerable time (∼15 days) and effort 358 Figure 1: NEG-FINDER: Local negative discovery identifying potential negative categories and subsequently optimising their associated seeds in trial and error bootstrapping runs. [sent-97, score-1.843]

35 By introducing manually crafted negative categories, a significant amount ofexpert domain knowl- edge is introduced. [sent-98, score-0.69]

36 To discover negative categories during bootstrapping, NEG-FINDER must identify a representative cluster of the drifted terms. [sent-105, score-1.156]

37 In this section, we present the two types of clustering used (maximum and outlier), and our three different levels of negative discovery (local, global and mixture). [sent-106, score-0.697]

38 1 Discovering negative categories We have observed that semantic drift begins to dominate when clusters of incorrect terms with similar meanings are extracted. [sent-108, score-1.451]

39 In NEG-FINDER, these drifted terms are cached as they may provide adequate seed terms for new negative categories. [sent-111, score-0.956]

40 However, the drifted terms can also include scattered polysemous or correct terms that share little similarity with the other drifted terms. [sent-112, score-0.646]

41 Therefore, simply using the first set of drifted terms to establish a negative category is likely to introduce noise rather than a cohesive competing category. [sent-113, score-0.975]

42 To discover negative categories, we exploit hierarchical clustering to group similar terms within the cache of drifted terms. [sent-114, score-1.022]

43 To ensure adequate coverage of the possible drifting topics, negative discovery and hence clustering is only performed when the drift cache consists of at least 20 terms. [sent-120, score-1.295]

44 In our next clustering method, we aim to form a negative category with as little similarity to the target seeds. [sent-125, score-0.794]

45 We use an outlier clustering strategy, in which the drifted term t with the least average distri359 butional similarity to the first n terms in the lexicon must be contained in the cluster of seeds. [sent-126, score-0.694]

46 3 Incorporating the negative category After a cluster of negative seed terms is established, the drift cache is cleared, and a new negative category is created and introduced into the iterative bootstrapping process in the next iteration. [sent-130, score-2.692]

47 The negative categories can compete with all other categories, including any previously introduced negative categories, however the negative categories do not con- tribute to the drift caches. [sent-132, score-2.569]

48 For this, the complete set of extracting patterns matching any of the negative seeds are considered and ranked with respect to the seeds. [sent-134, score-0.645]

49 The top scoring patterns are considered sequentially until m patterns are assigned to the new negative category. [sent-135, score-0.627]

50 To ensure mutual exclusion between the new category and the target categories, a candidate pattern that has previously been selected by a target category cannot be used to extract terms for either category in the subsequent iterations. [sent-136, score-0.875]

51 4 Levels of negative discovery Negative category discovery can be performed at a local or global level, or as a mixture of both. [sent-138, score-1.004]

52 In local discovery, each target category has its own drifted term cache and can generate negative categories irrespective of the other target categories. [sent-139, score-1.55]

53 The drifted terms (shaded) are extracted away from the lexicon into the local drift cache, which is then clustered. [sent-141, score-0.872]

54 Target cate- gories can also generate multiple negative categories across different iterations. [sent-143, score-0.86]

55 In global discovery, all drifted terms are pooled into a global cache, from which a single negative category can be identified in an iteration. [sent-144, score-1.059]

56 may be drifting into similar semantic categories, and enables these otherwise missed negative categories to be established. [sent-147, score-1.021]

57 In the mixture discovery method, both global and local negative categories can be formed. [sent-148, score-1.11]

58 Once a local negative category is formed, the terms within the local cache are cleared and also removed from the global cache. [sent-151, score-0.998]

59 This prevents multiple negative categories being instantiated with overlapping seed terms. [sent-152, score-0.959]

60 4 Experimental setup To compare the effectiveness of our negative discovery approaches we consider the task of extracting biomedical semantic lexicons from raw text. [sent-153, score-0.81]

61 2 Semantic categories The semantic categories we extract from MEDLINE were inspired by the TREC Genomics entities (Hersh et al. [sent-166, score-0.858]

62 3 Negative categories In our experiments, we use two different sets of negative categories. [sent-171, score-0.837]

63 The first set corresponds to those used in McIntosh and Curran (2008), and were identified by a domain expert as common sources of semantic drift in preliminary experiments with MEB and WMEB. [sent-173, score-0.681]

64 The ANIMAL and BODY PART categories were formed with the intention of preventing drift in the CELL, DISE and SIGN categories. [sent-175, score-0.809]

65 The ORGANISM category was then created to reduce the new drift forming in the DISE category after the first set of negative categories were introduced. [sent-176, score-1.558]

66 The second set of negative categories was identified by an independent domain expert with limited CATEGORYSEED TERMS 1AANMIIMNOAL ACIDianrsgiecntin mea cmysmteainle m gicleyc minoeus gelu rtaatmsate histamine knowledge of NLP and bootstrapping. [sent-177, score-1.013]

67 Unless otherwise stated, no hand-picked negative categories are used. [sent-202, score-0.837]

68 To ensure infrequent terms are not used to seed negative categories, drifted terms must occur at least 50 times to be re- tained in the drift cache. [sent-208, score-1.383]

69 Negative category discovery is only initiated when the drifted cache contains at least 20 terms, and a minimum of 5 terms are used to seed a negative category. [sent-209, score-1.304]

70 1 Influence of negative categories In our first experiments, we investigate the performance variations and improvements gained using negative categories selected by two independent domain experts. [sent-217, score-1.742]

71 Table 4 shows WMEB-DRIFT’s average precision over the 10 target categories with and without the two negative category sets. [sent-218, score-1.05]

72 This demonstrates the difficulty of selecting appropriate negative categories and seeds for the task, and in turn the necessity for tools to discover them automatically. [sent-220, score-1.008]

73 3 The first discovery approach corresponds to the na¨ ıve NEG-FINDER system that generates local negative categories from the first five drifted terms. [sent-225, score-1.201]

74 Compared to local discovery, global discovery is capable of detecting new negative categories earlier, and the categories it detects are more 3Statistical significance was tested using computationallyintensive randomisation tests (Cohen, 1995). [sent-230, score-1.396]

75 The NEG-FINDER mixture approach, which benefits from both local and global discovery, identifies the most useful negative categories. [sent-232, score-0.631]

76 Table 6 shows the seven discovered categories two local negative categories from CELL and TUMOUR, and five global categories were formed. [sent-233, score-1.754]

77 These results demonstrate that suitable negative categories can be identified and exploited during bootstrapping. [sent-239, score-0.906]

78 3 Boosting hand-picked negative categories In our next set of experiments, we investigate whether NEG-FINDER can improve state-of-theart performance by identifying new negative categories in addition to the manually selected negative TaNWb+lEMenG8-gF:BaItNiDPvRerIEf1FoRT+mlgacionxbteulrf60N71E4G-. [sent-241, score-2.166]

79 Both NEG-FINDER and WMEB-DRIFT are initialised with the 10 target categories and the first set of negative categories. [sent-245, score-0.931]

80 4 Restarting with new negative categories The performance improvements so far using NEGFINDER have been limited by the time at which new negative categories are discovered and incorporated into the bootstrapping process. [sent-250, score-2.008]

81 That is, system improvements can only be gained from the negative categories after they are generated. [sent-251, score-0.859]

82 For example, in Local NEG-FINDER, five negative categories are discovered in iterations 83, 85, 126, 130 and 150. [sent-252, score-0.926]

83 On the other hand, in the WMEB-DRIFT +negative experiments (Table 8 row 2), the hand-picked negative categories can start competing with the target categories in the very first iteration of bootstrapping. [sent-253, score-1.303]

84 Table 7 shows the average precision of WMEB-DRIFT over the 10 target categories when 363 it is restarted with the new negative categories discovered from our three approaches (using maximum clustering). [sent-255, score-1.35]

85 Over the first 200 terms, significant improvements are gained using the new negative categories (+6%). [sent-256, score-0.859]

86 However, the manually selected categories are far superior in preventing drift (+11%). [sent-257, score-0.798]

87 This may be attributed by the target categories not strongly drifting into the new negative categories until the later stages, whereas the hand-picked categories were selected on the basis of observed drift in the early stages (over the first 500 terms). [sent-258, score-2.104]

88 Table 7 shows that each of the discovered negative sets can significantly outperform the negative categories selected by a domain expert (negative set 2) (+0. [sent-266, score-1.525]

89 The discovered negative categories are more effective than the manually crafted sets in reducing semantic drift in the ANTIBODY, CELL and DISEASE lexicons. [sent-279, score-1.612]

90 NEG-FINDER also significantly boosts the performance of the original negative categories by identifying additional negative categories (row 5). [sent-328, score-1.674]

91 Our final experiment, where WMEB-DRIFT is re-initialised with the negative categories discovered by NEG-FINDER, further demonstrates the utility of our method. [sent-329, score-0.926]

92 On average, the discovered negative categories significantly outperform the manually crafted negative categories. [sent-330, score-1.57]

93 6 Conclusion In this paper, we have proposed the first completely unsupervised approach to identifying the negative categories that are necessary for bootstrapping large yet precise semantic lexicons. [sent-331, score-1.177]

94 Prior to this work, negative categories were manually crafted by a domain expert, undermining the advantages of an unsupervised bootstrapping paradigm. [sent-332, score-1.294]

95 We intend to use sophisticated clustering methods, such as CBC (Pantel, 2003), to identify multiple negative categories across the target categories in a single iteration. [sent-334, score-1.37]

96 Our initial analysis demonstrated that although excellent performance is achieved using negative categories, large performance variations occur when using categories crafted by different domain experts. [sent-336, score-1.035]

97 8 Table 10: Random seed results negative categories during bootstrapping. [sent-349, score-0.938]

98 NEGFINDER identifies cohesive negative categories and many of these are semantically similar to those identified by domain experts. [sent-350, score-0.923]

99 NEG-FINDER significantly outperforms the stateof-the-art algorithm WMEB-DRIFT, before negative categories are crafted, by up to 5. [sent-351, score-0.837]

100 The new discovered categories can also be fully exploited in bootstrapping, where they successfully outperform a domain expert’s negative categories and approach that of another expert. [sent-354, score-1.375]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('negative', 0.463), ('drift', 0.395), ('categories', 0.374), ('drifted', 0.222), ('mcintosh', 0.222), ('bootstrapping', 0.208), ('category', 0.163), ('crafted', 0.152), ('cache', 0.14), ('curran', 0.129), ('semantic', 0.11), ('lexicon', 0.107), ('discovery', 0.105), ('seed', 0.101), ('expert', 0.09), ('discovered', 0.089), ('wmeb', 0.089), ('mixture', 0.088), ('clustering', 0.086), ('terms', 0.085), ('patterns', 0.082), ('restart', 0.076), ('drifting', 0.074), ('tara', 0.074), ('seeds', 0.07), ('exclusion', 0.063), ('outlier', 0.063), ('yangarber', 0.059), ('lexicons', 0.057), ('riloff', 0.054), ('term', 0.051), ('target', 0.05), ('necessity', 0.049), ('candidate', 0.049), ('cluster', 0.048), ('domain', 0.046), ('biomedical', 0.045), ('initialised', 0.044), ('global', 0.043), ('medline', 0.042), ('competing', 0.042), ('formed', 0.04), ('cell', 0.04), ('identified', 0.04), ('dise', 0.038), ('australian', 0.038), ('local', 0.037), ('compete', 0.037), ('incorporated', 0.037), ('distributional', 0.036), ('reliability', 0.034), ('similarity', 0.032), ('ensure', 0.032), ('pool', 0.031), ('extracting', 0.03), ('mutual', 0.03), ('bootstrapper', 0.03), ('cleared', 0.03), ('genomics', 0.03), ('grover', 0.03), ('meb', 0.03), ('negfinder', 0.03), ('nicta', 0.03), ('nomen', 0.03), ('organism', 0.03), ('ravichandran', 0.03), ('tumour', 0.03), ('tumr', 0.03), ('winston', 0.03), ('manually', 0.029), ('exploited', 0.029), ('carlson', 0.028), ('jones', 0.028), ('pattern', 0.027), ('ellen', 0.027), ('discover', 0.026), ('tools', 0.026), ('extracted', 0.026), ('thelen', 0.025), ('roman', 0.025), ('trial', 0.025), ('hersh', 0.025), ('intervention', 0.025), ('initiated', 0.025), ('pantel', 0.025), ('clusters', 0.024), ('female', 0.023), ('randomised', 0.023), ('gories', 0.023), ('identify', 0.023), ('unsupervised', 0.022), ('lin', 0.022), ('sign', 0.022), ('phase', 0.022), ('gained', 0.022), ('instantiated', 0.021), ('animal', 0.021), ('xml', 0.021), ('minimally', 0.021), ('discovering', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999893 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

2 0.19865093 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.

3 0.12192332 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

Author: Longhua Qian ; Guodong Zhou

Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.

4 0.088859737 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

Author: Amit Goyal ; Ellen Riloff ; Hal Daume III

Abstract: In the 1980s, plot units were proposed as a conceptual knowledge structure for representing and summarizing narrative stories. Our research explores whether current NLP technology can be used to automatically produce plot unit representations for narrative text. We create a system called AESOP that exploits a variety of existing resources to identify affect states and applies “projection rules” to map the affect states onto the characters in a story. We also use corpus-based techniques to generate a new type of affect knowledge base: verbs that impart positive or negative states onto their patients (e.g., being eaten is an undesirable state, but being fed is a desirable state). We harvest these “patient polarity verbs” from a Web corpus using two techniques: co-occurrence with Evil/Kind Agent patterns, and bootstrapping over conjunctions of verbs. We evaluate the plot unit representations produced by our system on a small collection of Aesop’s fables.

5 0.088810518 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

Author: Mark Dredze ; Tim Oates ; Christine Piatko

Abstract: Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention – detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses Aodifst uannlaceb,e a dm eextraicm fpolre detecting tshhoifdts u sine sd Aatastreams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.

6 0.080534719 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

7 0.075840183 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

8 0.070212498 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

9 0.068593882 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

10 0.065598428 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

11 0.063857757 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

12 0.058810163 84 emnlp-2010-NLP on Spoken Documents Without ASR

13 0.057312321 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

14 0.052084107 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

15 0.05206608 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

16 0.05087484 40 emnlp-2010-Effects of Empty Categories on Machine Translation

17 0.049821362 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

18 0.049812127 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

19 0.045529943 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

20 0.043867093 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.166), (1, 0.113), (2, -0.063), (3, 0.142), (4, 0.046), (5, -0.0), (6, 0.041), (7, 0.022), (8, 0.042), (9, 0.129), (10, 0.173), (11, -0.084), (12, -0.146), (13, -0.18), (14, 0.028), (15, 0.048), (16, 0.05), (17, 0.052), (18, -0.217), (19, -0.091), (20, -0.304), (21, -0.07), (22, -0.167), (23, -0.077), (24, 0.085), (25, 0.181), (26, -0.041), (27, -0.045), (28, 0.081), (29, -0.099), (30, 0.133), (31, -0.066), (32, -0.064), (33, -0.098), (34, -0.09), (35, -0.12), (36, -0.116), (37, -0.089), (38, -0.08), (39, -0.08), (40, 0.121), (41, 0.045), (42, -0.003), (43, 0.052), (44, 0.032), (45, -0.12), (46, 0.008), (47, -0.031), (48, -0.032), (49, -0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98992026 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

2 0.70955157 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.

3 0.44605413 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

Author: Amit Goyal ; Ellen Riloff ; Hal Daume III

Abstract: In the 1980s, plot units were proposed as a conceptual knowledge structure for representing and summarizing narrative stories. Our research explores whether current NLP technology can be used to automatically produce plot unit representations for narrative text. We create a system called AESOP that exploits a variety of existing resources to identify affect states and applies “projection rules” to map the affect states onto the characters in a story. We also use corpus-based techniques to generate a new type of affect knowledge base: verbs that impart positive or negative states onto their patients (e.g., being eaten is an undesirable state, but being fed is a desirable state). We harvest these “patient polarity verbs” from a Web corpus using two techniques: co-occurrence with Evil/Kind Agent patterns, and bootstrapping over conjunctions of verbs. We evaluate the plot unit representations produced by our system on a small collection of Aesop’s fables.

4 0.40599895 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

Author: Ahmed Hassan ; Vahed Qazvinian ; Dragomir Radev

Abstract: Mining sentiment from user generated content is a very important task in Natural Language Processing. An example of such content is threaded discussions which act as a very important tool for communication and collaboration in the Web. Threaded discussions include e-mails, e-mail lists, bulletin boards, newsgroups, and Internet forums. Most of the work on sentiment analysis has been centered around finding the sentiment toward products or topics. In this work, we present a method to identify the attitude of participants in an online discussion toward one another. This would enable us to build a signed network representation of participant interaction where every edge has a sign that indicates whether the interaction is positive or negative. This is different from most of the research on social networks that has focused almost exclusively on positive links. The method is exper- imentally tested using a manually labeled set of discussion posts. The results show that the proposed method is capable of identifying attitudinal sentences, and their signs, with high accuracy and that it outperforms several other baselines.

5 0.36181995 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

Author: Longhua Qian ; Guodong Zhou

Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.

6 0.31650761 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

7 0.31613559 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

8 0.30631542 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

9 0.29076004 84 emnlp-2010-NLP on Spoken Documents Without ASR

10 0.26411057 40 emnlp-2010-Effects of Empty Categories on Machine Translation

11 0.25175193 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

12 0.2326715 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

13 0.22866111 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

14 0.22338971 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

15 0.21665782 114 emnlp-2010-Unsupervised Parse Selection for HPSG

16 0.20209146 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

17 0.20126225 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

18 0.19034314 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

19 0.18565938 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

20 0.171764 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.014), (10, 0.015), (12, 0.028), (29, 0.071), (30, 0.477), (32, 0.012), (52, 0.014), (56, 0.04), (62, 0.014), (66, 0.089), (72, 0.055), (76, 0.024), (82, 0.03), (87, 0.011), (89, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92162871 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

2 0.754067 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

Author: Aurelien Max

Abstract: In this article, an original view on how to improve phrase translation estimates is proposed. This proposal is grounded on two main ideas: first, that appropriate examples of a given phrase should participate more in building its translation distribution; second, that paraphrases can be used to better estimate this distribution. Initial experiments provide evidence of the potential of our approach and its implementation for effectively improving translation performance.

3 0.7456668 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

4 0.51433182 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

5 0.44162217 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

6 0.43950653 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

7 0.43477622 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

8 0.43039566 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

9 0.41373435 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

10 0.40785533 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

11 0.39971724 50 emnlp-2010-Facilitating Translation Using Source Language Paraphrase Lattices

12 0.39806378 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web

13 0.38581195 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

14 0.38226882 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

15 0.3807469 51 emnlp-2010-Function-Based Question Classification for General QA

16 0.37487242 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

17 0.37186962 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

18 0.37166589 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

19 0.37089732 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

20 0.37037686 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors