emnlp emnlp2012 emnlp2012-15 knowledge-graph by maker-knowledge-mining

15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification


Source: pdf

Author: Shoushan Li ; Shengfeng Ju ; Guodong Zhou ; Xiaojun Li

Abstract: Active learning is a promising way for sentiment classification to reduce the annotation cost. In this paper, we focus on the imbalanced class distribution scenario for sentiment classification, wherein the number of positive samples is quite different from that of negative samples. This scenario posits new challenges to active learning. To address these challenges, we propose a novel active learning approach, named co-selecting, by taking both the imbalanced class distribution issue and uncertainty into account. Specifically, our co-selecting approach employs two feature subspace classifiers to collectively select most informative minority-class samples for manual annotation by leveraging a certainty measurement and an uncertainty measurement, and in the meanwhile, automatically label most informative majority-class samples, to reduce humanannotation efforts. Extensive experiments across four domains demonstrate great potential and effectiveness of our proposed co-selecting approach to active learning for imbalanced sentiment classification. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract Active learning is a promising way for sentiment classification to reduce the annotation cost. [sent-6, score-0.433]

2 In this paper, we focus on the imbalanced class distribution scenario for sentiment classification, wherein the number of positive samples is quite different from that of negative samples. [sent-7, score-1.323]

3 To address these challenges, we propose a novel active learning approach, named co-selecting, by taking both the imbalanced class distribution issue and uncertainty into account. [sent-9, score-1.038]

4 Extensive experiments across four domains demonstrate great potential and effectiveness of our proposed co-selecting approach to active learning for imbalanced sentiment classification. [sent-11, score-1.135]

5 1 Introduction Sentiment classification is the task of identifying the sentiment polarity (e. [sent-12, score-0.378]

6 Most of previous studies in sentiment classification focus on learning models from a large number of labeled data. [sent-22, score-0.439]

7 In these situations, active learning approaches could be helpful by actively selecting most informative samples for manual annotation. [sent-24, score-1.054]

8 Compared to traditional active learning for sentiment classification, active learning for imbalanced sentiment classification faces some unique challenges. [sent-25, score-1.861]

9 Traditionally, uncertainty has been popularly used as a basic measurement in active learning (Lewis and Gale, 2004). [sent-29, score-0.632]

10 Therefore, how to select most informative MI samples for manual annotation without violating the basic PLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 13t C9–o1n4f8e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r. [sent-30, score-0.741]

11 lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl uncertainty requirement in active learning is challenging in imbalanced sentiment classification. [sent-32, score-1.265]

12 In this paper, we address above challenges in active learning for imbalanced sentiment classification. [sent-33, score-1.114]

13 We call our novel active learning approach co-selecting due to its collectively selecting informative samples through two disjoint feature subspace classifiers. [sent-37, score-1.364]

14 To further reduce the annotation efforts, we only manually annotate those most informative MI samples while those most informative MA samples are automatically labeled using the predicted labels provided by the first classifier. [sent-38, score-1.271]

15 In principle, our active learning approach differs from existing ones in two main aspects. [sent-39, score-0.369]

16 First, a certainty measurement and an uncertainty measurement are employed in two complementary subspace classifiers respectively to collectively select most informative MI samples for manual annotation. [sent-40, score-1.652]

17 Second, most informative MA samples are automatically labeled to further reduce the annotation cost. [sent-41, score-0.668]

18 Evaluation across four domains shows that our active learning approach is effective for imbalanced sentiment classification and significantly outperforms the state-of-the-art active learning alternatives, such as uncertainty sampling (Lewis and Gale, 2004) and co-testing (Muslea et al. [sent-42, score-1.795]

19 Section 2 overviews the related work on sentiment classification and active learning. [sent-45, score-0.747]

20 Section 3 proposes our active learning approach for imbalanced sentiment classification. [sent-46, score-1.114]

21 2 Related Work In this section, we give a brief overview sentiment classification and active learning. [sent-49, score-0.747]

22 However, imbalanced sentiment classification is relatively new and there are only a few studies in the literature. [sent-56, score-0.863]

23 (201 1a) pioneer the research in imbalanced sentiment classification and propose a co-training algorithm to perform semi-supervised learning for imbalanced sentiment classification with the help of a great amount of unlabeled samples. [sent-58, score-1.745]

24 However, their semi-supervised approach to imbalanced sentiment classification suffers from the problem that their balanced selection strategy in co-training would generate many errors in late iterations due to the imbalanced nature of the unbalanced data. [sent-59, score-1.449]

25 In comparison, our proposed active learning approach can effectively avoid this problem. [sent-60, score-0.369]

26 By the way, it is worth to note that the experiments therein show the superiority of undersampling over other alternatives such as costsensitive and one-class classification for imbalanced sentiment classification. [sent-61, score-0.867]

27 (201 1b) focus on supervised learning for imbalanced sentiment classification and propose a clustering-based approach to improve traditional under-sampling approaches. [sent-63, score-0.84]

28 Unlike all the studies mentioned above, our study pioneers active learning on imbalanced sentiment classification. [sent-65, score-1.137]

29 However, most previous studies focus on the scenario of balanced class distribution and only a few recent studies address the active learning issue on imbalanced classification problems including Yang and Ma (2010), Zhu and Hovy (2007), Ertekin et al. [sent-74, score-1.084]

30 Unfortunately, they straightly adopt the uncertainty sampling as the active selection strategy to address active learning in imbalanced classification, which completely ignores the class imbalance problem in the selected samples. [sent-77, score-1.659]

31 Attenberg and Provost (2010) highlights the importance of selecting samples by considering the proportion of the classes. [sent-78, score-0.53]

32 Their simulation experiment on text categorization confirms that selecting class-balanced samples is more important than traditional active selection strategies like uncertainty. [sent-79, score-0.929]

33 They first select a set of uncertainty samples and then randomly select balanced samples from the uncertainty-sample set. [sent-83, score-1.244]

34 Different from their study, our approach possesses two merits: First, two feature subspace classifiers are trained to finely integrate the certainty and uncertainty measurements. [sent-85, score-0.715]

35 Second, the MA samples are automatically annotated, 2 Ertekin et al. [sent-86, score-0.471]

36 (2007b) select samples closest to the hyperplane provided by the SVM classifier (within the margin). [sent-88, score-0.582]

37 3 Active Learning for Sentiment Classification Imbalanced Generally, active learning can be either streambased or pool-based (Sassano, 2002). [sent-91, score-0.369]

38 The main difference between the two is that the former scans through the data sequentially and selects informative samples individually, whereas the latter evaluates and ranks the entire collection before selecting most informative samples at batch. [sent-92, score-1.186]

39 As a large collection of samples can easily gathered once in sentiment classification, poolbased active learning is adopted in this study. [sent-93, score-1.1]

40 Figure 1 illustrates a standard pool-based active learning approach, where the most important issue is the sampling strategy, which evaluates the informativeness of one sample. [sent-94, score-0.414]

41 Use current classifier to label all unlabeled samples (3). [sent-97, score-0.562]

42 Use the sampling strategy to select n most informative samples for manual annotation (4). [sent-98, score-0.824]

43 Move newly-labeled samples from U to L Figure 1: Pool-based active learning 3. [sent-99, score-0.817]

44 Certainty As one of the most popular selection strategies in active learning, uncertainty sampling depends on an uncertainty measurement to select informative samples. [sent-101, score-1.045]

45 In imbalanced sentiment classification, MI samples are much sparse yet precious for learning and thus are believed to be more valuable for manual annotation. [sent-103, score-1.292]

46 The key in active learning for imbalanced sentiment classification is to guarantee both the quality and quantity of newly-added MI samples. [sent-104, score-1.237]

47 To guarantee the selection of MI samples, a certainty measurement is necessary. [sent-105, score-0.402]

48 In this study, the certainty measurement is defined as follows: Cer( d )ym{poasx, neg}P( y | d ) Meanwhile, in order to balance the samples in the two classes, once an informative MI sample is manually annotated, an informative MA sample is automatically labeled. [sent-106, score-1.163]

49 However, the two sampling strategies discussed above are apparently contradicted: while the uncertainty measurement is prone to selecting the samples whose posterior probabilities are nearest to 0. [sent-108, score-0.856]

50 5, the certainty measurement is prone to selecting the samples whose posterior probabilities are nearest to 1. [sent-109, score-0.847]

51 Therefore, it is essential to find a solution to balance uncertainty sampling and certainty sampling in imbalanced sentiment classification, 3. [sent-110, score-1.24]

52 2 Co-selecting Classifiers with Feature Subspace In sentiment classification, a document is represented as a feature vector generated from the feature set F  f1 ,. [sent-111, score-0.333]

53 In this study, we call a classifier trained with a feature subspace a feature subspace classifier. [sent-118, score-0.645]

54 Our basic idea of balancing both the uncertainty measurement and the certainty measurement is to train two subspace classifiers to adopt them respectively. [sent-119, score-0.935]

55 In our implementation, we randomly select two disjoint feature subspaces, each of which is used to train a subspace classifier. [sent-120, score-0.413]

56 On one side, one subspace classifier is employed to select some certain samples; on the other side, the other classifier is employed to select the most uncertain sample from those certain samples for manual f1S,. [sent-121, score-1.142]

57 In this way, the selected samples are certain in terms of one feature subspace for selecting more possible MI samples. [sent-125, score-0.871]

58 Meanwhile, the selected sample remains uncertain in terms of the other feature subspace to introduce uncertain knowledge into current learning model. [sent-126, score-0.575]

59 We name this approach as co-selecting because it collectively selects informative samples by two separate classifiers. [sent-127, score-0.627]

60 In our algorithm, we strictly constrain the balance of the samples between the two classes, i. [sent-129, score-0.493]

61 Therefore, once two samples are annotated with the same class label, they will not be added to the labeled data, as shown in step (7) in Figure 2. [sent-132, score-0.564]

62 Input: Labeled data L with balanced samples over the two classes Unlabeled pool U Output: New Labeled data L Procedure: Loop for N iterations: (1). [sent-133, score-0.53]

63 size r (with the proportion   r/ m ) from F Generate a feature subspace from FS and train a corresponding feature subspace classifier CCer with L Generate another feature subspace from the complement set of FS , i. [sent-140, score-0.966]

64 3 Co-selecting with Selected MA Samples Automatically Labeled Input: Labeled data L with balanced samples over the two classes Unlabeled pool U MA and MI Label (positive or negative) Output: New Labeled data L Procedure: Loop for N iterations: (1). [sent-149, score-0.53]

65 Generate a feature subspace from FS and train a corresponding subspace classifier CCer with L (3). [sent-151, score-0.62]

66 In our co-selecting approach, automatically labeling those selected MA samples is easy and 143 straightforward: the subspace classifier for monitoring the certainty measurement provides an ideal solution to annotate the samples that have been predicted as majority class. [sent-159, score-1.655]

67 Figure 3 shows the co-selecting algorithm with those selected MA samples automatically labeled. [sent-160, score-0.537]

68 4 Experimentation In this section, we will systematically evaluate our active learning approach for imbalanced sentiment classification and compare it with the state-of-theart active learning alternatives. [sent-166, score-1.578]

69 For each domain, we randomly select an initial balanced labeled data with 50 negative samples and 50 positive samples. [sent-172, score-0.697]

70 For the unlabeled data, we randomly select 2000 negative samples, and 14580/12160/7140/7560 positive samples from the four domains respectively, keeping the same imbalanced ratio as the whole data. [sent-173, score-1.151]

71 For the test data in each domain, we randomly extract 800 negative samples and 800 positive samples. [sent-174, score-0.543]

72 Classification algorithm The Maximum Entropy (ME) classifier implemented with the Mallet 3 tool is mainly adopted, except that in the margin-based active learning approach (Ertekin et al. [sent-175, score-0.418]

73 One sample is selected in each iteration;  Uncertainty: iteratively select samples using the uncertainty measurement according to the output of ME classifier. [sent-189, score-0.905]

74 One sample is selected in each iteration;  Certainty: iteratively select class-balanced samples using the certainty measurement according to the output of ME classifier. [sent-190, score-0.963]

75 One positive and negative sample (the positive and negative label is provided by the ME classifier) are selected in each iteration;  Co-testing: first get contention samples (i. [sent-191, score-0.738]

76 Note that the samples selected by these approaches are imbalanced. [sent-201, score-0.514]

77 To address the problem of classification on imbalanced data, we adopt the under-sampling strategy which has been shown effective for supervised imbalanced sentiment classification (Li et al. [sent-202, score-1.456]

78 Our active learning approach includes two versions: the co-selecting algorithm as described in Section 3. [sent-204, score-0.369]

79 2 and the co-selecting with selected MA samples automatically labeled as described in Section 3. [sent-205, score-0.575]

80 7 67 7482 8630 60 90 120 150 240 480 70 Nubmer of the manually annotated samples RandomCo-testingCo-selecting-plus Electronic Number of the manually annoated samples RandomCo-testingCo-selecting-plus 0 0 . [sent-213, score-0.995]

81 This verifies the effectiveness of automatically labeling those selected MA samples in imbalanced sentiment classification. [sent-224, score-1.304]

82 , certainty) performs worst, which reflects that only considering sample balance factor in imbalanced 145 sentiment classification is not helpful. [sent-227, score-0.932]

83 Figure 5 compares our approach to other active learning approaches by varying the number of the selected samples for manually annotation. [sent-228, score-0.928]

84 For clarity, we only include random selection and cotesting in comparison and do not show the performances of the other active learning approaches due to their similar behavior to random selection. [sent-229, score-0.451]

85 From this figure, we can see that cotesting is effective on Book and Electronic when less than 1500 samples are selected for manual annotation but it fails to outperform random selection in the other two domains. [sent-230, score-0.725]

86 05) when less than 4800 samples are selected for manual annotation. [sent-232, score-0.588]

87 Figure 6 shows the performance of co-selecting-plus with varying sizes of the feature subspaces for the first subspace classifier CCer . [sent-234, score-0.435]

88 This result also shows that the size of the feature subspace for selecting certain samples should be much less than that for selecting uncertain samples, which indicates the more important role of the uncertainty measurement in active learning. [sent-236, score-1.578]

89 This once again verifies the importance of the uncertainty strategy in active learning. [sent-246, score-0.58]

90 Number of MI samples selected for manual annotation In Table 1, we investigate the number of the MI samples selected for manual annotation using different active learning approaches when a total of 600 unlabeled samples are selected for annotation. [sent-247, score-2.234]

91 From this table, we can see that almost all the existing active learning approaches can only select a small amount of MI samples, taking similar imbalanced ratios as the whole unlabeled data. [sent-248, score-0.956]

92 Although the certainty approach could select many MI samples for annotation, this approach performs worst due to its totally ignoring the uncertainty factor. [sent-249, score-0.868]

93 When our approach is applied, especially co-selecting-plus, more MI samples are selected for manual annotation and finally included to learn the models. [sent-250, score-0.643]

94 This greatly improves the effectiveness of our active learning approach. [sent-251, score-0.369]

95 Table 1: The number of MI samples selected for manual annotation when 600 samples are annotated on the whole. [sent-252, score-1.113]

96 Precision of automatically labeled MA samples In co-selecting-plus, all the added MA samples are automatically labeled by the first subspace classifier. [sent-253, score-1.291]

97 5% of automatically labeled MA samples are correctly annotated in Book, DVD, Electronic, and Kitchen respectively. [sent-257, score-0.531]

98 This suggests that the subspace classifiers are able to predict the MA samples with a high precision. [sent-258, score-0.778]

99 5 Conclusion In this paper, we propose a novel active learning approach, named co-selecting, to reduce the annotation cost for imbalanced sentiment classification. [sent-260, score-1.169]

100 It first trains two complementary 146 classifiers with two disjoint feature subspaces and then uses them to collectively select most informative MI samples for manual annotation, leaving most informative MA samples for automatic annotation. [sent-261, score-1.473]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('imbalanced', 0.462), ('samples', 0.448), ('active', 0.369), ('sentiment', 0.283), ('subspace', 0.273), ('certainty', 0.209), ('uncertainty', 0.151), ('ertekin', 0.129), ('measurement', 0.112), ('mi', 0.109), ('informative', 0.104), ('classification', 0.095), ('ma', 0.091), ('ccer', 0.086), ('uncertain', 0.082), ('manual', 0.074), ('subspaces', 0.067), ('selected', 0.066), ('unlabeled', 0.065), ('select', 0.06), ('fs', 0.06), ('selecting', 0.059), ('classifiers', 0.057), ('cuncer', 0.057), ('muslea', 0.057), ('balanced', 0.056), ('class', 0.056), ('dvd', 0.056), ('annotation', 0.055), ('selection', 0.053), ('collectively', 0.052), ('classifier', 0.049), ('sample', 0.047), ('balance', 0.045), ('sampling', 0.045), ('committee', 0.044), ('doyle', 0.043), ('kitchen', 0.041), ('negative', 0.039), ('li', 0.039), ('lewis', 0.039), ('labeled', 0.038), ('strategy', 0.038), ('book', 0.035), ('positive', 0.035), ('disjoint', 0.034), ('member', 0.032), ('electronic', 0.031), ('neg', 0.031), ('minority', 0.031), ('opinion', 0.03), ('gale', 0.029), ('imbalance', 0.029), ('annoated', 0.029), ('attenberg', 0.029), ('contention', 0.029), ('cotesting', 0.029), ('kubat', 0.029), ('sensitiveness', 0.029), ('shengfeng', 0.029), ('tnrate', 0.029), ('tprate', 0.029), ('zhejiang', 0.029), ('guarantee', 0.028), ('meanwhile', 0.028), ('alternatives', 0.027), ('annotate', 0.027), ('pool', 0.026), ('pang', 0.026), ('feature', 0.025), ('hyperplane', 0.025), ('lloret', 0.025), ('nubmer', 0.025), ('precious', 0.025), ('shoushan', 0.025), ('manually', 0.024), ('proportion', 0.023), ('automatically', 0.023), ('selects', 0.023), ('proceeding', 0.023), ('studies', 0.023), ('annotated', 0.022), ('apparently', 0.022), ('cui', 0.022), ('verifies', 0.022), ('randomly', 0.021), ('loop', 0.021), ('iteratively', 0.021), ('varying', 0.021), ('adopt', 0.021), ('domains', 0.021), ('turney', 0.021), ('freund', 0.021), ('zhou', 0.02), ('blitzer', 0.019), ('wan', 0.019), ('prone', 0.019), ('thumbs', 0.019), ('bottou', 0.019), ('settles', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification

Author: Shoushan Li ; Shengfeng Ju ; Guodong Zhou ; Xiaojun Li

Abstract: Active learning is a promising way for sentiment classification to reduce the annotation cost. In this paper, we focus on the imbalanced class distribution scenario for sentiment classification, wherein the number of positive samples is quite different from that of negative samples. This scenario posits new challenges to active learning. To address these challenges, we propose a novel active learning approach, named co-selecting, by taking both the imbalanced class distribution issue and uncertainty into account. Specifically, our co-selecting approach employs two feature subspace classifiers to collectively select most informative minority-class samples for manual annotation by leveraging a certainty measurement and an uncertainty measurement, and in the meanwhile, automatically label most informative majority-class samples, to reduce humanannotation efforts. Extensive experiments across four domains demonstrate great potential and effectiveness of our proposed co-selecting approach to active learning for imbalanced sentiment classification. 1

2 0.15045419 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification

Author: Natalia Ponomareva ; Mike Thelwall

Abstract: This paper presents a comparative study of graph-based approaches for cross-domain sentiment classification. In particular, the paper analyses two existing methods: an optimisation problem and a ranking algorithm. We compare these graph-based methods with each other and with the other state-ofthe-art approaches and conclude that graph domain representations offer a competitive solution to the domain adaptation problem. Analysis of the best parameters for graphbased algorithms reveals that there are no optimal values valid for all domain pairs and that these values are dependent on the characteristics of corresponding domains.

3 0.10544816 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes

Author: Jong-Hoon Oh ; Kentaro Torisawa ; Chikara Hashimoto ; Takuya Kawada ; Stijn De Saeger ; Jun'ichi Kazama ; Yiou Wang

Abstract: In this paper we explore the utility of sentiment analysis and semantic word classes for improving why-question answering on a large-scale web corpus. Our work is motivated by the observation that a why-question and its answer often follow the pattern that if something undesirable happens, the reason is also often something undesirable, and if something desirable happens, the reason is also often something desirable. To the best of our knowledge, this is the first work that introduces sentiment analysis to non-factoid question answering. We combine this simple idea with semantic word classes for ranking answers to why-questions and show that on a set of 850 why-questions our method gains 15.2% improvement in precision at the top-1 answer over a baseline state-of-the-art QA system that achieved the best performance in a shared task of Japanese non-factoid QA in NTCIR-6.

4 0.090056829 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

Author: Lev Ratinov ; Dan Roth

Abstract: We explore the interplay of knowledge and structure in co-reference resolution. To inject knowledge, we use a state-of-the-art system which cross-links (or “grounds”) expressions in free text to Wikipedia. We explore ways of using the resulting grounding to boost the performance of a state-of-the-art co-reference resolution system. To maximize the utility of the injected knowledge, we deploy a learningbased multi-sieve approach and develop novel entity-based features. Our end system outperforms the state-of-the-art baseline by 2 B3 F1 points on non-transcript portion of the ACE 2004 dataset.

5 0.081130743 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling

Author: Sameer Singh ; Michael Wick ; Andrew McCallum

Abstract: Conditional random fields and other graphical models have achieved state of the art results in a variety of tasks such as coreference, relation extraction, data integration, and parsing. Increasingly, practitioners are using models with more complex structure—higher treewidth, larger fan-out, more features, and more data—rendering even approximate inference methods such as MCMC inefficient. In this paper we propose an alternative MCMC sam- pling scheme in which transition probabilities are approximated by sampling from the set of relevant factors. We demonstrate that our method converges more quickly than a traditional MCMC sampler for both marginal and MAP inference. In an author coreference task with over 5 million mentions, we achieve a 13 times speedup over regular MCMC inference.

6 0.079240501 116 emnlp-2012-Semantic Compositionality through Recursive Matrix-Vector Spaces

7 0.071019016 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

8 0.060140036 43 emnlp-2012-Exact Sampling and Decoding in High-Order Hidden Markov Models

9 0.059041649 28 emnlp-2012-Collocation Polarity Disambiguation Using Web-based Pseudo Contexts

10 0.053747129 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

11 0.049791146 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

12 0.049474113 139 emnlp-2012-Word Salad: Relating Food Prices and Descriptions

13 0.048763402 68 emnlp-2012-Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation

14 0.04782439 24 emnlp-2012-Biased Representation Learning for Domain Adaptation

15 0.045265581 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

16 0.044248544 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

17 0.04421844 36 emnlp-2012-Domain Adaptation for Coreference Resolution: An Adaptive Ensemble Approach

18 0.043068092 75 emnlp-2012-Large Scale Decipherment for Out-of-Domain Machine Translation

19 0.041389722 101 emnlp-2012-Opinion Target Extraction Using Word-Based Translation Model

20 0.0377701 44 emnlp-2012-Excitatory or Inhibitory: A New Semantic Orientation Extracts Contradiction and Causality from the Web


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.132), (1, 0.087), (2, 0.028), (3, 0.133), (4, 0.063), (5, -0.073), (6, -0.077), (7, -0.04), (8, 0.134), (9, -0.043), (10, 0.005), (11, 0.128), (12, -0.005), (13, -0.092), (14, 0.029), (15, 0.093), (16, -0.002), (17, 0.171), (18, -0.058), (19, 0.14), (20, -0.106), (21, -0.042), (22, 0.035), (23, 0.099), (24, 0.085), (25, -0.051), (26, 0.104), (27, -0.032), (28, -0.073), (29, -0.049), (30, -0.235), (31, 0.023), (32, -0.197), (33, 0.017), (34, 0.089), (35, -0.055), (36, -0.248), (37, -0.046), (38, -0.02), (39, 0.036), (40, -0.13), (41, -0.003), (42, -0.064), (43, 0.194), (44, 0.094), (45, -0.057), (46, 0.036), (47, -0.038), (48, 0.041), (49, -0.177)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98834223 15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification

Author: Shoushan Li ; Shengfeng Ju ; Guodong Zhou ; Xiaojun Li

Abstract: Active learning is a promising way for sentiment classification to reduce the annotation cost. In this paper, we focus on the imbalanced class distribution scenario for sentiment classification, wherein the number of positive samples is quite different from that of negative samples. This scenario posits new challenges to active learning. To address these challenges, we propose a novel active learning approach, named co-selecting, by taking both the imbalanced class distribution issue and uncertainty into account. Specifically, our co-selecting approach employs two feature subspace classifiers to collectively select most informative minority-class samples for manual annotation by leveraging a certainty measurement and an uncertainty measurement, and in the meanwhile, automatically label most informative majority-class samples, to reduce humanannotation efforts. Extensive experiments across four domains demonstrate great potential and effectiveness of our proposed co-selecting approach to active learning for imbalanced sentiment classification. 1

2 0.59750056 34 emnlp-2012-Do Neighbours Help? An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classification

Author: Natalia Ponomareva ; Mike Thelwall

Abstract: This paper presents a comparative study of graph-based approaches for cross-domain sentiment classification. In particular, the paper analyses two existing methods: an optimisation problem and a ranking algorithm. We compare these graph-based methods with each other and with the other state-ofthe-art approaches and conclude that graph domain representations offer a competitive solution to the domain adaptation problem. Analysis of the best parameters for graphbased algorithms reveals that there are no optimal values valid for all domain pairs and that these values are dependent on the characteristics of corresponding domains.

3 0.46184835 75 emnlp-2012-Large Scale Decipherment for Out-of-Domain Machine Translation

Author: Qing Dou ; Kevin Knight

Abstract: We apply slice sampling to Bayesian decipherment and use our new decipherment framework to improve out-of-domain machine translation. Compared with the state of the art algorithm, our approach is highly scalable and produces better results, which allows us to decipher ciphertext with billions of tokens and hundreds of thousands of word types with high accuracy. We decipher a large amount ofmonolingual data to improve out-of-domain translation and achieve significant gains of up to 3.8 BLEU points.

4 0.34492317 43 emnlp-2012-Exact Sampling and Decoding in High-Order Hidden Markov Models

Author: Simon Carter ; Marc Dymetman ; Guillaume Bouchard

Abstract: We present a method for exact optimization and sampling from high order Hidden Markov Models (HMMs), which are generally handled by approximation techniques. Motivated by adaptive rejection sampling and heuristic search, we propose a strategy based on sequentially refining a lower-order language model that is an upper bound on the true model we wish to decode and sample from. This allows us to build tractable variable-order HMMs. The ARPA format for language models is extended to enable an efficient use of the max-backoff quantities required to compute the upper bound. We evaluate our approach on two problems: a SMS-retrieval task and a POS tagging experiment using 5-gram models. Results show that the same approach can be used for exact optimization and sampling, while explicitly constructing only a fraction of the total implicit state-space.

5 0.34445402 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling

Author: Sameer Singh ; Michael Wick ; Andrew McCallum

Abstract: Conditional random fields and other graphical models have achieved state of the art results in a variety of tasks such as coreference, relation extraction, data integration, and parsing. Increasingly, practitioners are using models with more complex structure—higher treewidth, larger fan-out, more features, and more data—rendering even approximate inference methods such as MCMC inefficient. In this paper we propose an alternative MCMC sam- pling scheme in which transition probabilities are approximated by sampling from the set of relevant factors. We demonstrate that our method converges more quickly than a traditional MCMC sampler for both marginal and MAP inference. In an author coreference task with over 5 million mentions, we achieve a 13 times speedup over regular MCMC inference.

6 0.33379802 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes

7 0.32413748 44 emnlp-2012-Excitatory or Inhibitory: A New Semantic Orientation Extracts Contradiction and Causality from the Web

8 0.2743319 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

9 0.27270767 139 emnlp-2012-Word Salad: Relating Food Prices and Descriptions

10 0.2504839 116 emnlp-2012-Semantic Compositionality through Recursive Matrix-Vector Spaces

11 0.23822217 83 emnlp-2012-Lexical Differences in Autobiographical Narratives from Schizophrenic Patients and Healthy Controls

12 0.22007295 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

13 0.20729908 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

14 0.20332572 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

15 0.20184094 120 emnlp-2012-Streaming Analysis of Discourse Participants

16 0.17841536 36 emnlp-2012-Domain Adaptation for Coreference Resolution: An Adaptive Ensemble Approach

17 0.17202514 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models

18 0.16700011 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

19 0.16235229 68 emnlp-2012-Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation

20 0.15632257 134 emnlp-2012-User Demographics and Language in an Implicit Social Network


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.012), (16, 0.025), (18, 0.361), (34, 0.066), (60, 0.103), (63, 0.059), (64, 0.038), (65, 0.028), (70, 0.018), (73, 0.011), (74, 0.036), (76, 0.07), (80, 0.015), (86, 0.02), (95, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75943494 15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification

Author: Shoushan Li ; Shengfeng Ju ; Guodong Zhou ; Xiaojun Li

Abstract: Active learning is a promising way for sentiment classification to reduce the annotation cost. In this paper, we focus on the imbalanced class distribution scenario for sentiment classification, wherein the number of positive samples is quite different from that of negative samples. This scenario posits new challenges to active learning. To address these challenges, we propose a novel active learning approach, named co-selecting, by taking both the imbalanced class distribution issue and uncertainty into account. Specifically, our co-selecting approach employs two feature subspace classifiers to collectively select most informative minority-class samples for manual annotation by leveraging a certainty measurement and an uncertainty measurement, and in the meanwhile, automatically label most informative majority-class samples, to reduce humanannotation efforts. Extensive experiments across four domains demonstrate great potential and effectiveness of our proposed co-selecting approach to active learning for imbalanced sentiment classification. 1

2 0.62930053 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

Author: Mihai Surdeanu ; Julie Tibshirani ; Ramesh Nallapati ; Christopher D. Manning

Abstract: Distant supervision for relation extraction (RE) gathering training data by aligning a database of facts with text – is an efficient approach to scale RE to thousands of different relations. However, this introduces a challenging learning scenario where the relation expressed by a pair of entities found in a sentence is unknown. For example, a sentence containing Balzac and France may express BornIn or Died, an unknown relation, or no relation at all. Because of this, traditional supervised learning, which assumes that each example is explicitly mapped to a label, is not appropriate. We propose a novel approach to multi-instance multi-label learning for RE, which jointly models all the instances of a pair of entities in text and all their labels using a graphical model with latent variables. Our model performs competitively on two difficult domains. –

3 0.41638991 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

Author: David McClosky ; Christopher D. Manning

Abstract: We present a distantly supervised system for extracting the temporal bounds of fluents (relations which only hold during certain times, such as attends school). Unlike previous pipelined approaches, our model does not assume independence between each fluent or even between named entities with known connections (parent, spouse, employer, etc.). Instead, we model what makes timelines of fluents consistent by learning cross-fluent constraints, potentially spanning entities as well. For example, our model learns that someone is unlikely to start a job at age two or to marry someone who hasn’t been born yet. Our system achieves a 36% error reduction over a pipelined baseline.

4 0.40411445 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

Author: Lizhen Qu ; Rainer Gemulla ; Gerhard Weikum

Abstract: We propose the weakly supervised MultiExperts Model (MEM) for analyzing the semantic orientation of opinions expressed in natural language reviews. In contrast to most prior work, MEM predicts both opinion polarity and opinion strength at the level of individual sentences; such fine-grained analysis helps to understand better why users like or dislike the entity under review. A key challenge in this setting is that it is hard to obtain sentence-level training data for both polarity and strength. For this reason, MEM is weakly supervised: It starts with potentially noisy indicators obtained from coarse-grained training data (i.e., document-level ratings), a small set of diverse base predictors, and, if available, small amounts of fine-grained training data. We integrate these noisy indicators into a unified probabilistic framework using ideas from ensemble learning and graph-based semi-supervised learning. Our experiments indicate that MEM outperforms state-of-the-art methods by a significant margin.

5 0.40366328 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

6 0.40043193 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

7 0.39698768 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

8 0.39687625 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

9 0.39637536 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries

10 0.39300594 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

11 0.39289352 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

12 0.39224666 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

13 0.39165768 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

14 0.3904137 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

15 0.39001825 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

16 0.3885791 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP

17 0.38849697 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

18 0.38783368 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

19 0.38738394 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

20 0.38693121 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP