emnlp emnlp2010 emnlp2010-85 knowledge-graph by maker-knowledge-mining

85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Source: pdf

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 s g i 2 l iub @ c s Abstract This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. [sent-4, score-0.528]

2 Traditional binary classification involves building a classifier using labeled positive and negative training examples. [sent-5, score-0.826]

3 The classifier is then applied to classify test instances into positive and negative classes. [sent-6, score-0.859]

4 In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. [sent-9, score-0.585]

5 We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. [sent-11, score-0.435]

6 , Support Vector Machines (SVM), Bayesian classifier (NB)) is applied to the training examples to build a classifier that is subsequently employed to assign class labels to the instances in the test set. [sent-18, score-0.51]

7 In this paper, we study another special case of the problem in which the positive training and test samples have identical distributions, but the negative training and test samples may have different distributions. [sent-40, score-0.793]

8 As the focus in many applications is on identifying positive instances correctly, it is important that the positive training and the positive test data have the same distribution. [sent-42, score-0.746]

9 The distributions of the negative training and negative test data can be different. [sent-43, score-0.831]

10 The positive and negative training instances are governed by different unknown distributions p(x|λ) … … …, 218 ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e1 on O Ectmobpeir ic 2a0l1 M0. [sent-49, score-0.715]

11 The element yi of vector y = (y1, y2, yk) is the class label for training instance xi (yi ∈ {+1, -1}, where +1 and -1 denote positive and negative classes respectively) and is drawn based on an unknown target concept p(y|x). [sent-52, score-0.739]

12 The (hidden) positive test instances in XT are also governed by the unknown distribution p(x|λ), but the (hidden) negative test instances in XT are governed by an unknown distribution, p(x|θ θ), where θ may or may not be the same as δ. [sent-54, score-0.763]

13 One can consider labeling the negative data in each environment individually so that only the negative instances relevant to the testing environment are used to train the classifier. [sent-69, score-0.76]

14 It is clearly impractical to label all the negative data. [sent-73, score-0.395]

15 In this paper, we show that our special case of the sample selection bias problem can be solved in a much simpler and somewhat radical manner—by simply discarding the negative training data altogether. [sent-77, score-0.592]

16 We can use the positive training data and the unlabeled test data to build the classifier using the PU learning model (Liu et al. [sent-78, score-0.604]

17 PU learning was originally proposed to solve the learning problem where no labeled negative training data exist. [sent-80, score-0.479]

18 Several algorithms have been developed in the past few years that can learn from a set of labeled positive examples augmented with a set of unlabeled examples. [sent-81, score-0.391]

19 Our experimental evaluation shows that when the distributions of the negative training and test samples are different, PU learning is much more accurate than traditional supervised learning from the positive and negative training samples. [sent-86, score-1.257]

20 This means that the negative training data actually harms classification in this case. [sent-87, score-0.46]

21 In addition, when the distributions of the negative training and test samples are identical, PU learning is shown to perform equally well as supervised learning, which means that the negative training data is not needed. [sent-88, score-0.959]

22 Third, it experimentally demonstrates the effectiveness of the proposed method and shows that negative training data is not needed and can even be harmful. [sent-92, score-0.391]

23 In this paper, we adopt an entirely different approach by dropping the negative training data altogether in learning. [sent-111, score-0.391]

24 Without the negative training data, we use PU learning to solve the problem (Liu et al. [sent-112, score-0.435]

25 3 PU Learning Techniques In traditional supervised learning, ideally, there is a large number of labeled positive and negative examples for learning. [sent-132, score-0.676]

26 In practice, the negative examples can often be limited or unavailable. [sent-133, score-0.398]

27 This has motivated the development of the model of learning from positive and unlabeled examples, or PU learning, where P denotes a set of positive examples, and U a set of unlabeled examples (which contains both hidden positive and hidden negative instances). [sent-134, score-1.438]

28 The PU learning problem is to build a classifier using P and U in the absence of negative examples to classify the data in U or a future test data T. [sent-135, score-0.686]

29 The first step uses a spy technique to identify some reliable negatives (RN) from the unlabeled set U and the second step uses the EM algorithm to learn a Bayesian classifier from P, RN and U–RN. [sent-159, score-0.693]

30 Step 1: Extracting reliable negatives RN from U using a spy technique The spy technique in S-EM works as follows (Figure 1): First, a small set of positive examples (denoted by SP) called “spies” is randomly sampled from P (line 2). [sent-160, score-0.679]

31 Then, an NB classifier is built using P–SP as the positive set and U∪ SP as the negative set (lines 3-5). [sent-162, score-0.731]

32 Since the spy examples were from P and were put into U as negatives in building the NB classifier, they should behave similarly to the hidden positive instances in U. [sent-167, score-0.571]

33 We thus can use them to find the reliable negative set RN from U. [sent-168, score-0.452]

34 Spy technique for extracting RN from U Step 2: Learning using the EM algorithm Given the positive set P, the reliable negative set RN, and the remaining unlabeled set U–RN, we run EM using NB as the base learning algorithm. [sent-185, score-0.854]

35 Given a set of training documents D, each document di ∈ D is an ordered list of words. [sent-190, score-0.4]

36 5 then Output di as a positive document; else Output di as a negative document Figure 2. [sent-205, score-0.957]

37 We also have a set of classes C = {c1, c2} representing positive and negative classes. [sent-207, score-0.609]

38 In our setting, however, the negative class often has documents of mixed topics, e. [sent-229, score-0.616]

39 In particular, this method treats the entire unlabeled set Uas negative documents and then uses the positive set P and the unlabeled set U as the training data to build a Rocchio classifier. [sent-237, score-1.064]

40 Those documents that are classified as negative are then considered as reliable negative examples RN. [sent-239, score-1.057]

41 A positive representative vector (pr) is built by summing up the documents in P and normalizing it (lines 3-5). [sent-263, score-0.397]

42 We want to filter away as many as possible hidden positive documents from U so that we can obtain a very pure negative set. [sent-266, score-0.792]

43 Identifying RN using the Rocchio classifier ment dj in P could be near 0 or smaller than most (or even all) negative documents. [sent-277, score-0.61]

44 It would therefore be prudent to ignore a small percentage lof the documents in P most dissimilar to the representative positive (pr) and assume them as noise or outliers. [sent-278, score-0.426]

45 Then, for each document di in U, if its cosine similarity cos(pr, di) < ω, we regard it as a potential negative and store it in PN (lines 10-12). [sent-282, score-0.588]

46 Sub-step 2 (extracting the final reliable negative set RN from U using Rocchio with PN): At this point, we have a positive set P and a potential negative set PN where PN is a purer negative set than U. [sent-285, score-1.453]

47 Those documents in U that are classified as negatives by RC will then be regarded as reliable negatives, and stored in set RN. [sent-287, score-0.471]

48 Following the Rocchio formula, a positive and a negative prototype vectors p and n are built (lines 3 and 4), which are used to classify the documents in U (lines 5-7). [sent-289, score-0.83]

49 α and β are parameters for adjusting the relative impact of the positive and negative examples. [sent-290, score-0.585]

50 Step 2: Learning by running SVM iteratively This step is similar to that in Roc-SVM, building 223 the final classifier by running SVM iteratively with the sets P, RN and the remaining unlabeled set Q (Q = U – RN). [sent-293, score-0.388]

51 We run SVM classifiers Si (line 3) iteratively to extract more and more negative documents from Q. [sent-295, score-0.566]

52 The iteration stops when no more negative documents can be extracted from Q (line 5). [sent-296, score-0.542]

53 It is possible that during some iteration, SVM is misled by noisy data to extract many positive documents from Q and put them in the negative set RN. [sent-298, score-0.762]

54 To do so, we use the final SVM classifier obtained at convergence (called Slast, line 9) to classify the positive set P to see if many positive documents in P are classified as negatives. [sent-301, score-0.885]

55 If there are 5% of positive documents (5%*|P|) in P that are classified as negative, it indicates that SVM has gone wrong and we should use the first SVM classifier (S1). [sent-303, score-0.573]

56 Since PN is clearly a purer negative set than U, the use of PN by CR-SVM helps extract a better quality reliable negative set RN which subsequently allows the final classifier of CRSVM to give better results than Roc-SVM. [sent-307, score-1.039]

57 CR-EM actually works quite well as it is also able to exploit the more accurate reliable negative set RN extracted using cosine and Rocchio. [sent-312, score-0.501]

58 4 Empirical Evaluation We now present the experimental results to support our claim that negative training data is not needed and can even harm text classification. [sent-313, score-0.391]

59 The following methods are compared: (1) traditional supervised learning methods SVM and NB which use both positive and negative training data; (2) PU learning methods, including two existing methods S-EM and Roc-SVM and two new methods CR-SVM and CR-EM, and (3) one-class SVM (Schölkop et al. [sent-315, score-0.757]

60 , 1999) where only positive training data is used in learning (the unlabeled set is not used at all). [sent-316, score-0.428]

61 This set of experiments simulates the scenario in which the negative training and test samples have different distributions. [sent-329, score-0.455]

62 We select positive, negative and other topic documents for Reuters and 20 Newsgroup, and produce various data sets. [sent-330, score-0.624]

63 Let the set of documents in Q that are classified as negative be W; 5. [sent-351, score-0.572]

64 If more than 5% positives are classified as negative 11. [sent-357, score-0.395]

65 Constructing the final classifier using SVM ter than traditional learning that uses both positive and negative training data. [sent-360, score-0.883]

66 We randomly select one or two of the remaining categories as the negative class (denoted by Neg 1 or Neg 2), and then we randomly choose some documents from the rest of the categories as other topic documents. [sent-362, score-0.698]

67 These other topic documents are regarded as negatives and added to the test set but not to the negative training data. [sent-363, score-0.857]

68 They thus introduce a different distribution to the negative test data. [sent-364, score-0.395]

69 We are able to simulate two scenarios: (1) the other topic documents are similar to the negative class documents (similar case), and (2) the other topic documents are quite different from the negative class documents (different case). [sent-367, score-1.773]

70 This allows us to investigate whether the classification results will be affected when the other topic documents are somewhat similar or vastly different from the negative training set. [sent-368, score-0.719]

71 To create the training and test data for our experiments, we randomly select one sub-category from a main category (cat 1) as the positive class, and one (or two) subcategory from another category (cat 2) as the negative class (again denoted by Neg 1 or Neg 2). [sent-369, score-0.715]

72 The training and test sets are then constructed as follows: we partition the positive (and similarly for the negative) class documents into two standard subsets: 70% for training and 30% for testing. [sent-379, score-0.553]

73 In order to create different experimental settings, we vary the number of the other topic documents that are added to the test set as negatives, controlled by a parameter α, which is a percentage of |TN|, where |TN| is the size of the negative test set without the other topic documents. [sent-380, score-0.766]

74 Here we want to show that PU learning can do equally well without using the negative training data even in the traditional setting. [sent-385, score-0.517]

75 the distributions of the negative training and test data are different (caused by the inclusion of other topic documents in the test set, or the addition of other topic documents to complement existing negatives in the test set). [sent-389, score-1.221]

76 1 Results on the Reuters data Figure 6 shows the comparison results when the negative class contains only one category of documents (Neg 1), while Figure 7 shows the results when the negative class contains documents from two categories (Neg 2) in the Reuters collection. [sent-393, score-1.232]

77 When the size of the other topic documents (xaxis) in the test set increases, the F-scores of the 225 two traditional learning methods SVM and NB decreased much more dramatically as compared with the PU learning techniques. [sent-396, score-0.435]

78 The EM based methods (CR-EM and S-EM) performed well in the case when we had only one negative class (Figure 6). [sent-400, score-0.439]

79 However, it did not do well in the situation where there were two negative classes (Figure 7) due to the underlying mixture model assumption of the naïve Bayesian classifier. [sent-401, score-0.422]

80 Similar case: Here, the other topic documents are similar to the negative class documents, as they belong to the same main category. [sent-406, score-0.698]

81 Again, the F-scores of the traditional supervised learning (SVM and NB) deteriorated when more other topic documents were added to the test set, while CR-EM, S-EM and CR-SVM were able to remain unaffected and maintained roughly constant F-scores. [sent-410, score-0.391]

82 When the negative class contained documents from two categories (Neg 2), the F-scores of the traditional learning dropped even more rapidly. [sent-411, score-0.718]

83 Newsgroup data Newsgroup data Different case: In this case, the other topic documents are quite different from the negative class documents, since they are originated from different main categories. [sent-413, score-0.721]

84 As the other topic documents have very different distributions from the negatives in the training set in this case, they really confused the traditional classifiers. [sent-416, score-0.565]

85 20 Newsgroup data 20 Newsgroup data In summary, the results showed that learning with negative training data based on the traditional paradigm actually harms classification when the identical distribution assumption does not hold. [sent-418, score-0.595]

86 Note that for PU learning, the negative training data were not used. [sent-426, score-0.391]

87 The traditional supervised learning techniques (SVM and NB), which made full use of the positive and negative training data, only performed just about 1-2% better than the PU learning method CR-SVM (which is not statistically significant based on paired t-test). [sent-427, score-0.757]

88 This suggests that we can do away with negative training data, since PU learning can perform equally well without them. [sent-428, score-0.459]

89 This has practical importance since the full coverage of negative training data is hard to find and to label in many applications. [sent-429, score-0.421]

90 From the results in Figures 6–1 1 and Table 1, we can conclude that PU learning can be used for binary text classification without the negative training data (which can be harmful for the task). [sent-430, score-0.546]

91 9 ge78w24065 )s 5 Conclusions This paper studied a special case of the sample selection bias problem in which the positive training and test distributions are the same, but the negative training and test distributions may be different. [sent-437, score-0.988]

92 We showed that in this case, the negative training data should not be used in learning, and PU learning can be applied to this setting. [sent-438, score-0.435]

93 Our experiments showed that the traditional classification methods suffered greatly when the distributions are different for the negative training 227 and test data, but PU learning does not. [sent-440, score-0.637]

94 As such, it can be advantageous to discard the potentially harmful negative training data and use PU learning for classification. [sent-442, score-0.477]

95 In our future work, we plan to do more comprehensive experiments to compare the classic supervised learning and PU learning techniques with different kinds of settings, for example, by varying the ratio between positive and negative examples, as well as their sizes. [sent-443, score-0.673]

96 Finally, we would like to point out that it is conceivable that negative training data could still be useful in many cases. [sent-445, score-0.391]

97 An interesting direction to explore is to somehow combine the extracted reliable negative data from the unlabeled set and the existing negative training data to further enhance learning algorithms. [sent-446, score-1.025]

98 Text classification and co-training from positive and unlabeled examples. [sent-507, score-0.427]

99 Learning to classify texts using positive and unlabeled data, IJCAI. [sent-568, score-0.426]

100 Text classification from labeled and unlabeled documents using EM. [sent-611, score-0.384]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pu', 0.486), ('negative', 0.365), ('rn', 0.257), ('positive', 0.22), ('nb', 0.198), ('documents', 0.177), ('negatives', 0.177), ('di', 0.148), ('classifier', 0.146), ('unlabeled', 0.138), ('newsgroup', 0.135), ('svm', 0.13), ('bickel', 0.122), ('pr', 0.121), ('pn', 0.12), ('neg', 0.106), ('rocchio', 0.104), ('dj', 0.099), ('cos', 0.096), ('covariate', 0.095), ('liu', 0.091), ('bias', 0.089), ('sp', 0.088), ('reliable', 0.087), ('topic', 0.082), ('sugiyama', 0.081), ('spy', 0.081), ('class', 0.074), ('classification', 0.069), ('classify', 0.068), ('cj', 0.066), ('traditional', 0.058), ('em', 0.057), ('elkan', 0.054), ('noto', 0.054), ('shimodaira', 0.054), ('denis', 0.054), ('li', 0.053), ('transfer', 0.052), ('spam', 0.052), ('document', 0.049), ('distributions', 0.045), ('sample', 0.044), ('learning', 0.044), ('nigam', 0.042), ('harmful', 0.042), ('ddjj', 0.041), ('heckman', 0.041), ('slast', 0.041), ('zadrozny', 0.041), ('selection', 0.04), ('lines', 0.04), ('yu', 0.038), ('cat', 0.036), ('salton', 0.036), ('sentiment', 0.036), ('samples', 0.034), ('reuters', 0.034), ('density', 0.034), ('assumption', 0.033), ('examples', 0.033), ('step', 0.032), ('shift', 0.032), ('dempster', 0.031), ('test', 0.03), ('instances', 0.03), ('label', 0.03), ('papers', 0.03), ('classified', 0.03), ('hidden', 0.03), ('noise', 0.029), ('governed', 0.029), ('special', 0.028), ('else', 0.027), ('bollmann', 0.027), ('cherniavsky', 0.027), ('connexis', 0.027), ('ddii', 0.027), ('dudik', 0.027), ('kashima', 0.027), ('lkop', 0.027), ('muller', 0.027), ('osvm', 0.027), ('pac', 0.027), ('purer', 0.027), ('rocsvm', 0.027), ('spies', 0.027), ('emails', 0.027), ('training', 0.026), ('cosine', 0.026), ('ample', 0.025), ('equations', 0.025), ('subsequently', 0.025), ('equally', 0.024), ('iteratively', 0.024), ('kdd', 0.024), ('classes', 0.024), ('final', 0.024), ('quite', 0.023), ('tsuboi', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

2 0.19865093 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

Abstract: Multi-category bootstrapping algorithms were developed to reduce semantic drift. By extracting multiple semantic lexicons simultaneously, a category’s search space may be restricted. The best results have been achieved through reliance on manually crafted negative categories. Unfortunately, identifying these categories is non-trivial, and their use shifts the unsupervised bootstrapping paradigm towards a supervised framework. We present NEG-FINDER, the first approach for discovering negative categories automatically. NEG-FINDER exploits unsupervised term clustering to generate multiple negative categories during bootstrapping. Our algorithm effectively removes the necessity of manual intervention and formulation of negative categories, with performance closely approaching that obtained using negative categories defined by a domain expert.

3 0.16350991 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.

4 0.10909205 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

Author: Mark Dredze ; Tim Oates ; Christine Piatko

Abstract: Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention – detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses Aodifst uannlaceb,e a dm eextraicm fpolre detecting tshhoifdts u sine sd Aatastreams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.

5 0.098230034 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

Abstract: With the proliferation of user-generated articles over the web, it becomes imperative to develop automated methods that are aware of the ideological-bias implicit in a document collection. While there exist methods that can classify the ideological bias of a given document, little has been done toward understanding the nature of this bias on a topical-level. In this paper we address the problem ofmodeling ideological perspective on a topical level using a factored topic model. We develop efficient inference algorithms using Collapsed Gibbs sampling for posterior inference, and give various evaluations and illustrations of the utility of our model on various document collections with promising results. Finally we give a Metropolis-Hasting inference algorithm for a semi-supervised extension with decent results.

6 0.09778183 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

7 0.091080323 40 emnlp-2010-Effects of Empty Categories on Machine Translation

8 0.088539168 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

9 0.087087579 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

10 0.084817946 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

11 0.08179839 104 emnlp-2010-The Necessity of Combining Adaptation Methods

12 0.07789357 61 emnlp-2010-Improving Gender Classification of Blog Authors

13 0.075494468 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

14 0.073329233 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

15 0.071163446 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

16 0.067259625 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification

17 0.056825899 39 emnlp-2010-EMNLP 044

18 0.052486658 84 emnlp-2010-NLP on Spoken Documents Without ASR

19 0.05094919 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

20 0.049928941 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.199), (1, 0.135), (2, -0.175), (3, 0.025), (4, 0.073), (5, 0.024), (6, 0.143), (7, 0.062), (8, -0.037), (9, 0.12), (10, 0.17), (11, 0.025), (12, -0.207), (13, -0.033), (14, 0.027), (15, 0.17), (16, 0.064), (17, 0.111), (18, -0.221), (19, -0.029), (20, -0.238), (21, -0.039), (22, -0.192), (23, -0.142), (24, 0.068), (25, 0.062), (26, -0.026), (27, 0.044), (28, 0.008), (29, 0.093), (30, 0.084), (31, -0.121), (32, -0.068), (33, -0.014), (34, 0.084), (35, -0.162), (36, -0.064), (37, -0.025), (38, 0.054), (39, -0.104), (40, -0.016), (41, 0.011), (42, 0.057), (43, -0.026), (44, -0.022), (45, -0.08), (46, 0.033), (47, 0.05), (48, 0.003), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97645932 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

2 0.74281615 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

Author: Tara McIntosh

3 0.45001292 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

4 0.42197859 40 emnlp-2010-Effects of Empty Categories on Machine Translation

Author: Tagyoung Chung ; Daniel Gildea

Abstract: We examine effects that empty categories have on machine translation. Empty categories are elements in parse trees that lack corresponding overt surface forms (words) such as dropped pronouns and markers for control constructions. We start by training machine translation systems with manually inserted empty elements. We find that inclusion of some empty categories in training data improves the translation result. We expand the experiment by automatically inserting these elements into a larger data set using various methods and training on the modified corpus. We show that even when automatic prediction of null elements is not highly accurate, it nevertheless improves the end translation result.

5 0.4162938 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

Author: John Platt ; Kristina Toutanova ; Wen-tau Yih

Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corre- sponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

6 0.41616479 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

7 0.40763098 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

8 0.37893215 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

9 0.36936352 61 emnlp-2010-Improving Gender Classification of Blog Authors

10 0.32061514 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

11 0.31524965 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

12 0.29417685 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

13 0.27685574 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

14 0.2682361 104 emnlp-2010-The Necessity of Combining Adaptation Methods

15 0.2668612 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

16 0.26441976 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

17 0.26052567 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

18 0.25432131 84 emnlp-2010-NLP on Spoken Documents Without ASR

19 0.24744435 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

20 0.24681169 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.021), (10, 0.021), (12, 0.044), (29, 0.087), (30, 0.074), (32, 0.015), (52, 0.018), (56, 0.099), (61, 0.271), (62, 0.012), (66, 0.157), (72, 0.043), (76, 0.018), (82, 0.011), (87, 0.018), (89, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76170373 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

2 0.61545569 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

Author: Ahmed Hassan ; Vahed Qazvinian ; Dragomir Radev

Abstract: Mining sentiment from user generated content is a very important task in Natural Language Processing. An example of such content is threaded discussions which act as a very important tool for communication and collaboration in the Web. Threaded discussions include e-mails, e-mail lists, bulletin boards, newsgroups, and Internet forums. Most of the work on sentiment analysis has been centered around finding the sentiment toward products or topics. In this work, we present a method to identify the attitude of participants in an online discussion toward one another. This would enable us to build a signed network representation of participant interaction where every edge has a sign that indicates whether the interaction is positive or negative. This is different from most of the research on social networks that has focused almost exclusively on positive links. The method is exper- imentally tested using a manually labeled set of discussion posts. The results show that the proposed method is capable of identifying attitudinal sentences, and their signs, with high accuracy and that it outperforms several other baselines.

3 0.61521327 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

Author: Amr Ahmed ; Eric Xing

4 0.61397171 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

Author: Jordan Boyd-Graber ; Philip Resnik

Abstract: In this paper, we develop multilingual supervised latent Dirichlet allocation (MLSLDA), a probabilistic generative model that allows insights gleaned from one language’s data to inform how the model captures properties of other languages. MLSLDA accomplishes this by jointly modeling two aspects of text: how multilingual concepts are clustered into thematically coherent topics and how topics associated with text connect to an observed regression variable (such as ratings on a sentiment scale). Concepts are represented in a general hierarchical framework that is flexible enough to express semantic ontologies, dictionaries, clustering constraints, and, as a special, degenerate case, conventional topic models. Both the topics and the regression are discovered via posterior inference from corpora. We show MLSLDA can build topics that are consistent across languages, discover sensible bilingual lexical correspondences, and leverage multilingual corpora to better predict sentiment. Sentiment analysis (Pang and Lee, 2008) offers the promise of automatically discerning how people feel about a product, person, organization, or issue based on what they write online, which is potentially of great value to businesses and other organizations. However, the vast majority of sentiment resources and algorithms are limited to a single language, usually English (Wilson, 2008; Baccianella and Sebastiani, 2010). Since no single language captures a majority of the content online, adopting such a limited approach in an increasingly global community risks missing important details and trends that might only be available when text in multiple languages is taken into account. 45 Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik@umd .edu Up to this point, multiple languages have been addressed in sentiment analysis primarily by transferring knowledge from a resource-rich language to a less rich language (Banea et al., 2008), or by ignoring differences in languages via translation into English (Denecke, 2008). These approaches are limited to a view of sentiment that takes place through an English-centric lens, and they ignore the potential to share information between languages. Ideally, learning sentiment cues holistically, across languages, would result in a richer and more globally consistent picture. In this paper, we introduce Multilingual Supervised Latent Dirichlet Allocation (MLSLDA), a model for sentiment analysis on a multilingual corpus. MLSLDA discovers a consistent, unified picture of sentiment across multiple languages by learning “topics,” probabilistic partitions of the vocabulary that are consistent in terms of both meaning and relevance to observed sentiment. Our approach makes few assumptions about available resources, requiring neither parallel corpora nor machine translation. The rest of the paper proceeds as follows. In Section 1, we describe the probabilistic tools that we use to create consistent topics bridging across languages and the MLSLDA model. In Section 2, we present the inference process. We discuss our set of semantic bridges between languages in Section 3, and our experiments in Section 4 demonstrate that this approach functions as an effective multilingual topic model, discovers sentiment-biased topics, and uses multilingual corpora to make better sentiment predictions across languages. Sections 5 and 6 discuss related research and discusses future work, respectively. ProcMe IdTi,n Mgsas ofsa tchehu 2se0t1t0s, C UoSnAfe,r 9e-n1ce1 o Onc Etombepri 2ic0a1l0 M. ?ec th2o0d1s0 i Ans Nsaotcuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgagueis ti 4c5s–5 , 1 Predictions from Multilingual Topics As its name suggests, MLSLDA is an extension of Latent Dirichlet allocation (LDA) (Blei et al., 2003), a modeling approach that takes a corpus of unannotated documents as input and produces two outputs, a set of “topics” and assignments of documents to topics. Both the topics and the assignments are probabilistic: a topic is represented as a probability distribution over words in the corpus, and each document is assigned a probability distribution over all the topics. Topic models built on the foundations of LDA are appealing for sentiment analysis because the learned topics can cluster together sentimentbearing words, and because topic distributions are a parsimonious way to represent a document.1 LDA has been used to discover latent structure in text (e.g. for discourse segmentation (Purver et al., 2006) and authorship (Rosen-Zvi et al., 2004)). MLSLDA extends the approach by ensuring that this latent structure the underlying topics is consistent across languages. We discuss multilingual topic modeling in Section 1. 1, and in Section 1.2 we show how this enables supervised regression regardless of a document’s language. — — 1.1 Capturing Semantic Correlations Topic models posit a straightforward generative process that creates an observed corpus. For each docu- ment d, some distribution θd over unobserved topics is chosen. Then, for each word position in the document, a topic z is selected. Finally, the word for that position is generated by selecting from the topic indexed by z. (Recall that in LDA, a “topic” is a distribution over words). In monolingual topic models, the topic distribution is usually drawn from a Dirichlet distribution. Using Dirichlet distributions makes it easy to specify sparse priors, and it also simplifies posterior inference because Dirichlet distributions are conjugate to multinomial distributions. However, drawing topics from Dirichlet distributions will not suffice if our vocabulary includes multiple languages. If we are working with English, German, and Chinese at the same time, a Dirichlet prior has no way to favor distributions z such that p(good|z), p(gut|z), and 1The latter property has also made LDA popular for information retrieval (Wei and Croft, 2006)). 46 p(h aˇo|z) all tend to be high at the same time, or low at hth ˇaeo same lti tmened. tMoo bree generally, et sheam structure oorf our model must encourage topics to be consistent across languages, and Dirichlet distributions cannot encode correlations between elements. One possible solution to this problem is to use the multivariate normal distribution, which can produce correlated multinomials (Blei and Lafferty, 2005), in place of the Dirichlet distribution. This has been done successfully in multilingual settings (Cohen and Smith, 2009). However, such models complicate inference by not being conjugate. Instead, we appeal to tree-based extensions of the Dirichlet distribution, which has been used to induce correlation in semantic ontologies (Boyd-Graber et al., 2007) and to encode clustering constraints (Andrzejewski et al., 2009). The key idea in this approach is to assume the vocabularies of all languages are organized according to some shared semantic structure that can be represented as a tree. For concreteness in this section, we will use WordNet (Miller, 1990) as the representation of this multilingual semantic bridge, since it is well known, offers convenient and intuitive terminology, and demonstrates the full flexibility of our approach. However, the model we describe generalizes to any tree-structured rep- resentation of multilingual knowledge; we discuss some alternatives in Section 3. WordNet organizes a vocabulary into a rooted, directed acyclic graph of nodes called synsets, short for “synonym sets.” A synset is a child of another synset if it satisfies a hyponomy relationship; each child “is a” more specific instantiation of its parent concept (thus, hyponomy is often called an “isa” relationship). For example, a “dog” is a “canine” is an “animal” is a “living thing,” etc. As an approximation, it is not unreasonable to assume that WordNet’s structure of meaning is language independent, i.e. the concept encoded by a synset can be realized using terms in different languages that share the same meaning. In practice, this organization has been used to create many alignments of international WordNets to the original English WordNet (Ordan and Wintner, 2007; Sagot and Fiˇ ser, 2008; Isahara et al., 2008). Using the structure of WordNet, we can now describe a generative process that produces a distribution over a multilingual vocabulary, which encourages correlations between words with similar meanings regardless of what language each word is in. For each synset h, we create a multilingual word distribution for that synset as follows: 1. Draw transition probabilities βh ∼ Dir (τh) 2. Draw stop probabilities ωh ∼ Dir∼ (κ Dhi)r 3. For each language l, draw emission probabilities for that synset φh,l ∼ Dir (πh,l) . For conciseness in the rest of the paper, we will refer to this generative process as multilingual Dirichlet hierarchy, or MULTDIRHIER(τ, κ, π) .2 Each observed token can be viewed as the end result of a sequence of visited synsets λ. At each node in the tree, the path can end at node iwith probability ωi,1, or it can continue to a child synset with probability ωi,0. If the path continues to another child synset, it visits child j with probability βi,j. If the path ends at a synset, it generates word k with probability φi,l,k.3 The probability of a word being emitted from a path with visited synsets r and final synset h in language lis therefore p(w, λ = r, h|l, β, ω, φ) = (iY,j)∈rβi,jωi,0(1 − ωh,1)φh,l,w. Note that the stop probability ωh (1) is independent of language, but the emission φh,l is dependent on the language. This is done to prevent the following scenario: while synset A is highly probable in a topic and words in language 1attached to that synset have high probability, words in language 2 have low probability. If this could happen for many synsets in a topic, an entire language would be effectively silenced, which would lead to inconsistent topics (e.g. 2Variables τh, πh,l, and κh are hyperparameters. Their mean is fixed, but their magnitude is sampled during inference (i.e. Pkτhτ,ih,k is constant, but τh,i is not). For the bushier bridges, (Pe.g. dictionary and flat), their mean is uniform. For GermaNet, we took frequencies from two balanced corpora of German and English: the British National Corpus (University of Oxford, 2006) and the Kern Corpus of the Digitales Wo¨rterbuch der Deutschen Sprache des 20. Jahrhunderts project (Geyken, 2007). We took these frequencies and propagated them through the multilingual hierarchy, following LDAWN’s (Boyd-Graber et al., 2007) formulation of information content (Resnik, 1995) as a Bayesian prior. The variance of the priors was initialized to be 1.0, but could be sampled during inference. 3Note that the language and word are taken as given, but the path through the semantic hierarchy is a latent random variable. 47 Topic 1 is about baseball in English and about travel in German). Separating path from emission helps ensure that topics are consistent across languages. Having defined topic distributions in a way that can preserve cross-language correspondences, we now use this distribution within a larger model that can discover cross-language patterns of use that predict sentiment. 1.2 The MLSLDA Model We will view sentiment analysis as a regression problem: given an input document, we want to predict a real-valued observation y that represents the sentiment of a document. Specifically, we build on supervised latent Dirichlet allocation (SLDA, (Blei and McAuliffe, 2007)), which makes predictions based on the topics expressed in a document; this can be thought of projecting the words in a document to low dimensional space of dimension equal to the number of topics. Blei et al. showed that using this latent topic structure can offer improved predictions over regressions based on words alone, and the approach fits well with our current goals, since word-level cues are unlikely to be identical across languages. In addition to text, SLDA has been successfully applied to other domains such as social networks (Chang and Blei, 2009) and image classification (Wang et al., 2009). The key innovation in this paper is to extend SLDA by creating topics that are globally consistent across languages, using the bridging approach above. We express our model in the form of a probabilistic generative latent-variable model that generates documents in multiple languages and assigns a realvalued score to each document. The score comes from a normal distribution whose sum is the dot product between a regression parameter η that encodes the influence of each topic on the observation and a variance σ2. With this model in hand, we use statistical inference to determine the distribution over latent variables that, given the model, best explains observed data. The generative model is as follows: 1. For each topic i= 1. . . K, draw a topic distribution {βi, ωi, φi} from MULTDIRHIER(τ, κ, π). 2. {Foβr each do}cuf mroemn tM Md = 1. . . M with language ld: (a) CDihro(oαse). a distribution over topics θd ∼ (b) For each word in the document n = 1. . . Nd, choose a topic assignment zd,n ∼ Mult (θd) and a path λd,n ending at word wd,n according to Equation 1using {βzd,n , ωzd,n , φzd,n }. 3. Choose a re?sponse variable from y Norm ?η> z¯, σ2?, where z¯ d ≡ N1 PnN=1 zd,n. ∼ Crucially, note that the topics are not independent of the sentiment task; the regression encourages terms with similar effects on the observation y to be in the same topic. The consistency of topics described above allows the same regression to be done for the entire corpus regardless of the language of the underlying document. 2 Inference Finding the model parameters most likely to explain the data is a problem of statistical inference. We employ stochastic EM (Diebolt and Ip, 1996), using a Gibbs sampler for the E-step to assign words to paths and topics. After randomly initializing the topics, we alternate between sampling the topic and path of a word (zd,n, λd,n) and finding the regression parameters η that maximize the likelihood. We jointly sample the topic and path conditioning on all of the other path and document assignments in the corpus, selecting a path and topic with probability p(zn = k, λn = r|z−n , λ−n, wn , η, σ, Θ) = p(yd|z, η, σ)p(λn = r|zn = k, λ−n, wn, τ, p(zn = k|z−n, α) . κ, π) (2) Each of these three terms reflects a different influence on the topics from the vocabulary structure, the document’s topics, and the response variable. In the next paragraphs, we will expand each of them to derive the full conditional topic distribution. As discussed in Section 1.1, the structure of the topic distribution encourages terms with the same meaning to be in the same topic, even across languages. During inference, we marginalize over possible multinomial distributions β, ω, and φ, using the observed transitions from ito j in topic k; Tk,i,j, stop counts in synset iin topic k, Ok,i,0; continue counts in synsets iin topic k, Ok,i,1 ; and emission counts in synset iin language lin topic k, Fk,i,l. The 48 Multilingual Topics Text Documents Sentiment Prediction Figure 1: Graphical model representing MLSLDA. Shaded nodes represent observations, plates denote replication, and lines show probabilistic dependencies. probability of taking a path r is then p(λn = r|zn = k, λ−n) = (iY,j)∈r PBj0Bk,ik,j,i,+j0 τ+i,j τi,jPs∈0O,1k,Oi,1k,+i,s ω+i ωi,s! |(iY,j)∈rP{zP} Tran{szitiPon Ok,rend,0 + ωrend Fk,rend,wn + πrend,}l Ps∈0,1Ok,rend,s+ ωrend,sPw0Frend,w0+ πrend,w0 |PEmi{szsiPon} (3) Equation 3 reflects the multilingual aspect of this model. The conditional topic distribution for SLDA (Blei and McAuliffe, 2007) replaces this term with the standard Multinomial-Dirichlet. However, we believe this is the first published SLDA-style model using MCMC inference, as prior work has used variational inference (Blei and McAuliffe, 2007; Chang and Blei, 2009; Wang et al., 2009). Because the observed response variable depends on the topic assignments of a document, the conditional topic distribution is shifted toward topics that explain the observed response. Topics that move the predicted response yˆd toward the true yd will be favored. We drop terms that are constant across all topics for the effect of the response variable, p(yd|z, η, σ) ∝ exp?σ12?yd−PPk0kN0Nd,dk,0kη0k0?Pkη0Nzkd,k0? |??PP{z?P?} . Other wPord{zs’ influence exp

5 0.60880286 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

Author: Quang Do ; Dan Roth

Abstract: Determining whether two terms in text have an ancestor relation (e.g. Toyota and car) or a sibling relation (e.g. Toyota and Honda) is an essential component of textual inference in NLP applications such as Question Answering, Summarization, and Recognizing Textual Entailment. Significant work has been done on developing stationary knowledge sources that could potentially support these tasks, but these resources often suffer from low coverage, noise, and are inflexible when needed to support terms that are not identical to those placed in them, making their use as general purpose background knowledge resources difficult. In this paper, rather than building a stationary hierarchical structure of terms and relations, we describe a system that, given two terms, determines the taxonomic relation between them using a machine learning-based approach that makes use of existing resources. Moreover, we develop a global constraint opti- mization inference process and use it to leverage an existing knowledge base also to enforce relational constraints among terms and thus improve the classifier predictions. Our experimental evaluation shows that our approach significantly outperforms other systems built upon existing well-known knowledge sources.

6 0.60792011 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

7 0.60583949 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

8 0.60540622 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

9 0.60071427 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

10 0.5992853 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

11 0.59793347 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

12 0.59781319 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

13 0.59724462 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

14 0.59654558 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

15 0.59262508 104 emnlp-2010-The Necessity of Combining Adaptation Methods

16 0.59212613 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

17 0.59159786 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

18 0.59151137 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

19 0.59087294 84 emnlp-2010-NLP on Spoken Documents Without ASR

20 0.59034503 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice