emnlp emnlp2013 emnlp2013-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhiyuan Chen ; Arjun Mukherjee ; Bing Liu ; Meichun Hsu ; Malu Castellanos ; Riddhiman Ghosh
Abstract: Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have been used for the task. However, such models without any domain knowledge often produce aspects that are not interpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. However, existing knowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). Experiments on real-life product reviews from a variety of domains show that MCLDA outperforms the existing state-of-the-art models markedly.
Reference: text
sentIndex sentText sentNum sentScore
1 However, such models without any domain knowledge often produce aspects that are not interpretable in applications. [sent-6, score-0.27]
2 To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. [sent-7, score-0.465]
3 However, existing knowledge-based topic models have several major shortcomings, e. [sent-8, score-0.251]
4 , little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. [sent-10, score-0.352]
5 This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). [sent-11, score-0.77]
6 1 Introduction In sentiment analysis and opinion mining, aspect extraction aims to extract entity aspects or features on which opinions have been expressed (Hu and Liu, 2004; Liu, 2012). [sent-13, score-0.544]
7 ” Aspect extraction consists of two sub-tasks: (1) extracting all aspect terms (e. [sent-15, score-0.187]
8 , cluster “picture” and “photo” into one aspect category as they mean the same in the domain “Camera”). [sent-19, score-0.252]
9 com adopt the topic modeling approach as it can perform both sub-tasks simultaneously (see § 2). [sent-24, score-0.251]
10 , 2003), provide an unsupervised framework for extracting latent topics in text documents. [sent-26, score-0.192]
11 However, in recent years, researchers have found that fully unsupervised topic models may not produce topics that are very coherent for a particular application. [sent-28, score-0.464]
12 This is because the objective functions of topic models do not always correlate well with human judgments and needs (Chang et al. [sent-29, score-0.251]
13 To address the issue, several knowledge-based topic models have been proposed. [sent-31, score-0.251]
14 A must-link states that two words (or terms) should belong to the same topic whereas a cannotlink indicates that two words should not be in the same topic. [sent-34, score-0.251]
15 Furthermore, none of the existing models, including DF-LDA, is able to automatically adjust the number of topics based on domain knowledge. [sent-46, score-0.293]
16 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is6t5ic5s–16 7, domain “Computer”, a topic model may generate two topics Battery and Screen that represent two different aspects. [sent-51, score-0.509]
17 A cannot-link {battery, screen} as the domain knowledge is thus consistent with the corpus. [sent-52, score-0.159]
18 However, words Amazon and Price may appear in the same topic due to their high cooccurrences in the Amazon. [sent-53, score-0.251]
19 In this case, the number of topics needs to be increased by 1 since the mixed topic has to be separated into two individual topics Amazon and Price. [sent-56, score-0.567]
20 Apart from the above shortcoming, earlier knowledge-based topic models also have some major shortcomings: Incapability of handling multiple senses: A word typically has multiple meanings or senses. [sent-57, score-0.294]
21 , 2012) allows multiple senses, it requires that each topic has at most one set of seed words (seed set), which is restrictive as the amount of knowledge should not be limited. [sent-65, score-0.345]
22 This can harm the final topics because the attenuation of the frequent (often domain important) words can result in some irrelevant words being ranked higher (with higher probabilities). [sent-68, score-0.258]
23 To address the above shortcomings, we define m-set (for must-set) as a set of words that should belong to the same topic and c-set (cannot-set) as a set of words that should not be in the same topic. [sent-69, score-0.251]
24 We then propose a new topic model, called MCLDA (LDA with m-set and c-set), which is not only able to deal with c-sets and automatically adjust the number of topics, but also deal with the multiple senses and adverse effect of knowledge problems at the same time. [sent-87, score-0.583]
25 Then, we employ the generalized Pólya urn (GPU) model (Mahmoud, 2008) to address the issue of adverse effect of knowledge (§ 4). [sent-90, score-0.666]
26 Deviating from the standard topic modeling approaches, we propose the Extended generalized Pólya urn (E-GPU) model (§ 5). [sent-91, score-0.77]
27 It proposed a new knowledge-based topic model called MC-LDA, which is able to use both m-sets and c-sets, as well as automatically adjust the number of topics based on domain knowledge. [sent-98, score-0.544]
28 It proposed the E-GPU model to enable multiurn interactions, which enables c-sets to be naturally integrated into a topic model. [sent-102, score-0.251]
29 According to (Liu, 2012), there are three main approaches to aspect extraction: 1) Using word frequency and syntactic dependency of aspects and sentiment words for extraction (e. [sent-109, score-0.41]
30 In this work, we focus on topic models owing to their advantage of performing both aspect extraction and clustering simultaneously. [sent-143, score-0.438]
31 We also notice that some aspect extraction models in sentiment analysis separately discover aspect words and aspect specific sentiment words (e. [sent-151, score-0.715]
32 Our proposed model does not separate them as most sentiment words also imply aspects and most adjectives modify specific attributes of objects. [sent-155, score-0.223]
33 For example, sentiment words expensive and beautiful imply aspects price and appearance respectively. [sent-156, score-0.309]
34 , 2013b), we proposed a framework (called GK-LDA) to explicitly deal with the wrong knowledge when exploring the lexical semantic relations as the general (domain independent) knowledge in topic models. [sent-164, score-0.414]
35 The generalized Pólya urn (GPU) model (Mahmoud, 2008) was first introduced in LDA by Mimno et al. [sent-166, score-0.519]
36 Our results in § 7 show that using domain knowledge can significantly improve aspect extraction. [sent-170, score-0.311]
37 The GPU model was also employed in topic models in our work of (Chen et al. [sent-171, score-0.251]
38 1 Generalized Pólya urn (GPU) Model The Pólya urn model involves an urn containing balls of different colors. [sent-370, score-1.544]
39 At discrete time intervals, balls are added or removed from the urn according to their color distributions. [sent-371, score-0.798]
40 In the simple Pólya urn (SPU) model, a ball is first drawn randomly from the urn and its color is recorded, then that ball is put back along with a new ball of the same color. [sent-372, score-1.886]
41 This selection process is repeated and the contents of the urn change over time, with a self-reinforcing property sometimes expressed as “the rich get richer. [sent-373, score-0.461]
42 The generalized Pólya urn (GPU) model differs from the SPU model in the replacement scheme during sampling. [sent-375, score-0.552]
43 Specifically, when a ball is randomly drawn, certain numbers of additional balls of each color are returned to the urn, rather than just two balls of the same color as in SPU. [sent-376, score-0.923]
44 2 Promoting M-sets using GPU To deal with the issue of sensitivity to the adverse effect of knowledge, MDK-LDA(b) is extended to MDK-LDA which employs the generalized Pólya urn (GPU) sampling scheme. [sent-378, score-0.761]
45 1 Extended Generalized Pólya urn Model To handle the complex situation resulted from incorporating c-sets, we propose an Extended generalized Pólya urn (E-GPU) model. [sent-413, score-0.98]
46 Instead of involving only one urn as in SPU and GPU, EGPU model considers a set of urns in the sampling process. [sent-414, score-0.656]
47 The E-GPU model allows a ball to be transferred from one urn to another, enabling multi-urn interactions. [sent-415, score-0.744]
48 Thus, during sampling, the populations of several urns will evolve even if only one ball is drawn from one urn. [sent-416, score-0.407]
49 We define three sets of urns which will be used in the new sampling scheme in the proposed MCLDA model. [sent-418, score-0.228]
50 colors (topics) and each ball inside has a color ? [sent-424, score-0.425]
51 To tackle the issue, we utilize the proposed E-GPU model and incorporate c-sets handling inside the E-GPU sampling scheme, which is also designed to enable automated adjustment of the number of topics based on domain knowledge. [sent-454, score-0.379]
52 As the E-GPU model bilities of those cannot-words under this topic while increasing their corresponding probabilities under some other topics. [sent-456, score-0.251]
53 In order to correctly transfer a ball that represents word ? [sent-457, score-0.293]
54 , it should be transferred to an urn which has a higher proportion of ? [sent-458, score-0.526]
55 That is, we randomly sample an urn that has a higher proportion of any m-set of ? [sent-463, score-0.492]
56 For example, aspects price and amazon may be mixed under one topic (say ? [sent-467, score-0.493]
57 In this case, according to LDA, word price has no topic with a higher proportion of it (and its related words) than topic ? [sent-470, score-0.619]
58 To transfer it, we need to increment the number of topics by 1 and then transfer the word to this new topic urn (step 3 c below). [sent-472, score-0.992]
59 ′} such that each urn in it satisfies the following conditions: i) ? [sent-568, score-0.499]
60 , we perform hierarchical sampling consisting of the following three steps (the detailed algorithms are given in Figures 2 and 3): Step 1 (Lines 1-11 in Figure 2): We jointly sample a topic ? [sent-751, score-0.329]
61 from the corpus with topic and m-set assignments being ? [sent-834, score-0.251]
62 Step 3 (lines 6-12 in Figure 3): For each drawn ball ? [sent-955, score-0.29]
63 () is an indicator function, which restricts the ball to be transferred only to an urn that contains a higher proportion of its m-set. [sent-1055, score-0.775]
64 can be successfully sampled and the current sweep (iteration) of Gibbs sampling has the same number of topic (? [sent-1057, score-0.358]
65 Two unsupervised baseline models that we compare with are: • LDA: LDA is the basic unsupervised topic model (Blei et al. [sent-1066, score-0.251]
66 , word camera in the domain “Camera”, since it co-occurs with most words in the corpus, leading to high similarity among topics/aspects. [sent-1098, score-0.159]
67 Sentences as documents: As noted in (Titov and McDonald, 2008), when standard topic models are applied to reviews as documents, they tend to produce topics that correspond to global properties of products (e. [sent-1099, score-0.469]
68 , brand name), which make topics overlapping with each other. [sent-1101, score-0.158]
69 The reason is that all reviews of the same type of products discuss about the same aspects of these products. [sent-1102, score-0.171]
70 Domain knowledge: User knowledge about a domain can vary a great deal. [sent-1137, score-0.159]
71 However, the perplexity metric does not reflect the semantic coherence of individual topics learned by a topic model (Newman et al. [sent-1158, score-0.564]
72 Also, perplexity does not really reflect our goal of finding coherent aspects with accurate semantic clustering. [sent-1162, score-0.211]
73 , 201 1) (also called the “UMass” measure (Stevens and Buttler, 2012)) was proposed as a better alternative for assessing topic quality. [sent-1165, score-0.251]
74 It was shown that topic coherence is highly consistent with human expert labeling by Mimno et al. [sent-1167, score-0.403]
75 Higher topic coherence score indicates higher quality of topics, i. [sent-1169, score-0.361]
76 This shows that the guidance using domain knowledge is more effective than using co-document frequency. [sent-1197, score-0.159]
77 We only say that for the task of aspect extraction and leveraging domain knowledge, these models do not generate as coherent aspects as ours because of their shortcomings discussed in § 1. [sent-1204, score-0.535]
78 is larger than 15, aspects found by each model became more and more overlapping, with several aspects expressing the same features of products. [sent-1207, score-0.222]
79 3 Human Evaluation Since our aim is to make topics more interpretable and conformable to human judgments, we worked with two judges who are familiar with Amazon products and reviews to evaluate the models subjectively. [sent-1221, score-0.281]
80 Since topics from topic models are rankings based on word probability and we do not know the number of correct topical words, a natural way to evaluate these rankings is to use Precision@n (or p@n) which was also used in (Mukherjee and Liu, 2012; Zhao et al. [sent-1222, score-0.409]
81 There are two steps in human evaluation: topic labeling and word labeling. [sent-1225, score-0.293]
82 , 201 1) and asked the judges to label each topic as good or bad. [sent-1227, score-0.314]
83 Each topic was presented as a list of 10 most probable words in descending order of their probabilities under that topic. [sent-1228, score-0.251]
84 The models which generated the topics for labeling were obscure to the judges. [sent-1229, score-0.2]
85 In general, each topic was annotated as good if it had more than half of its words coherently related to each other representing a semantic concept together; otherwise bad. [sent-1230, score-0.285]
86 Agreement of human judges on topic 1663 labeling using Cohen’s Kappa yielded a score of 0. [sent-1231, score-0.356]
87 This is reasonable as topic labeling is an easy task and semantic coherence can be judged well by humans. [sent-1233, score-0.403]
88 Word Labeling: After topic labeling, we chose the topics, which were labeled as good by both judges, as good topics. [sent-1234, score-0.251]
89 Since judges already had the conception of each topic in mind when they were labeling topics, labeling each word was not difficult which explains the high Kappa score for this labeling task (score = 0. [sent-1237, score-0.44]
90 We also found that when the domain knowledge is simple with one word usually expressing only one meaning/sense (e. [sent-1248, score-0.159]
91 The results from LDA-GPU and DF-LDA were inferior and hard for the human judges to match them with aspects found by the other models for qualitative comparison. [sent-1258, score-0.174]
92 Table 4 shows three aspects Amazon, Price, Battery generated by each model in the domain TCAoa#FvGmboeaprlioecdurgtas3e. [sent-1259, score-0.211]
93 8 Conclusion This paper proposed a new model to exploit domain knowledge in the form of m-sets and c-sets to generate coherent aspects (topics) from online reviews. [sent-1271, score-0.359]
94 A comprehensive evaluation using real-life online reviews from multiple domains shows that MCLDA outperforms the state-of-the-art models significantly and discovers aspects with high semantic coherence. [sent-1274, score-0.205]
95 Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. [sent-1289, score-0.41]
96 A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. [sent-1293, score-0.193]
97 ILDA: interdependent LDA model for learning latent aspects and their ratings from online product reviews. [sent-1486, score-0.214]
98 Labeled LDA: a supervised topic model for credit attribution in multilabeled corpora. [sent-1520, score-0.251]
99 online of ACL, pages Coherence 952–961 and stances Topic Veselin pages debates. [sent-1540, score-0.17]
100 Constrained LDA for grouping product features in opinion mining. [sent-1612, score-0.169]
wordName wordTfidf (topN-words)
[('urn', 0.461), ('topic', 0.251), ('ball', 0.249), ('lya', 0.185), ('color', 0.176), ('gpu', 0.175), ('lda', 0.163), ('balls', 0.161), ('topics', 0.158), ('aspect', 0.152), ('andrzejewski', 0.15), ('opinion', 0.134), ('mclda', 0.118), ('urns', 0.117), ('sentiment', 0.112), ('aspects', 0.111), ('coherence', 0.11), ('domain', 0.1), ('mukherjee', 0.092), ('mimno', 0.089), ('gibbs', 0.088), ('adverse', 0.088), ('price', 0.086), ('shortcomings', 0.082), ('sampling', 0.078), ('lu', 0.078), ('liu', 0.073), ('pages', 0.068), ('malu', 0.067), ('meichun', 0.067), ('riddhiman', 0.067), ('judges', 0.063), ('chengxiang', 0.062), ('senses', 0.06), ('reviews', 0.06), ('camera', 0.059), ('bing', 0.059), ('sauper', 0.059), ('spu', 0.059), ('knowledge', 0.059), ('generalized', 0.058), ('chen', 0.057), ('coherent', 0.055), ('proceedings', 0.054), ('battery', 0.054), ('titov', 0.053), ('burns', 0.051), ('castellanos', 0.051), ('gibbssampling', 0.051), ('zhiyuan', 0.051), ('jagarlamudi', 0.05), ('arjun', 0.05), ('yue', 0.049), ('jo', 0.047), ('zhai', 0.046), ('sampler', 0.046), ('amazon', 0.045), ('deal', 0.045), ('perplexity', 0.045), ('transfer', 0.044), ('handling', 0.043), ('labeling', 0.042), ('blei', 0.042), ('drawn', 0.041), ('petterson', 0.04), ('hongning', 0.04), ('hsu', 0.04), ('satisfies', 0.038), ('wiebe', 0.038), ('oh', 0.036), ('kdd', 0.036), ('shares', 0.036), ('summarization', 0.036), ('extraction', 0.035), ('seed', 0.035), ('hu', 0.035), ('product', 0.035), ('adjust', 0.035), ('wang', 0.034), ('latent', 0.034), ('online', 0.034), ('alice', 0.034), ('coherently', 0.034), ('dcomputercare', 0.034), ('edmund', 0.034), ('egpu', 0.034), ('increment', 0.034), ('ishwaran', 0.034), ('mahmoud', 0.034), ('moghaddam', 0.034), ('talley', 0.034), ('transferred', 0.034), ('scheme', 0.033), ('li', 0.032), ('cikm', 0.032), ('extended', 0.031), ('proportion', 0.031), ('emnlp', 0.03), ('xia', 0.03), ('sweep', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
Author: Zhiyuan Chen ; Arjun Mukherjee ; Bing Liu ; Meichun Hsu ; Malu Castellanos ; Riddhiman Ghosh
Abstract: Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have been used for the task. However, such models without any domain knowledge often produce aspects that are not interpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. However, existing knowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). Experiments on real-life product reviews from a variety of domains show that MCLDA outperforms the existing state-of-the-art models markedly.
2 0.16528402 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
Author: Xinjie Zhou ; Xiaojun Wan ; Jianguo Xiao
Abstract: Microblog messages pose severe challenges for current sentiment analysis techniques due to some inherent characteristics such as the length limit and informal writing style. In this paper, we study the problem of extracting opinion targets of Chinese microblog messages. Such fine-grained word-level task has not been well investigated in microblogs yet. We propose an unsupervised label propagation algorithm to address the problem. The opinion targets of all messages in a topic are collectively extracted based on the assumption that similar messages may focus on similar opinion targets. Topics in microblogs are identified by hashtags or using clustering algorithms. Experimental results on Chinese microblogs show the effectiveness of our framework and algorithms.
3 0.1540902 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
Author: Wei Wang ; Hua Xu ; Xiaoqiu Huang
Abstract: Implicit feature detection, also known as implicit feature identification, is an essential aspect of feature-specific opinion mining but previous works have often ignored it. We think, based on the explicit sentences, several Support Vector Machine (SVM) classifiers can be established to do this task. Nevertheless, we believe it is possible to do better by using a constrained topic model instead of traditional attribute selection methods. Experiments show that this method outperforms the traditional attribute selection methods by a large margin and the detection task can be completed better.
4 0.15288669 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
Author: Hiroshi Noji ; Daichi Mochihashi ; Yusuke Miyao
Abstract: One of the language phenomena that n-gram language model fails to capture is the topic information of a given situation. We advance the previous study of the Bayesian topic language model by Wallach (2006) in two directions: one, investigating new priors to alleviate the sparseness problem caused by dividing all ngrams into exclusive topics, and two, developing a novel Gibbs sampler that enables moving multiple n-grams across different documents to another topic. Our blocked sampler can efficiently search for higher probability space even with higher order n-grams. In terms of modeling assumption, we found it is effective to assign a topic to only some parts of a document.
5 0.14661793 121 emnlp-2013-Learning Topics and Positions from Debatepedia
Author: Swapna Gottipati ; Minghui Qiu ; Yanchuan Sim ; Jing Jiang ; Noah A. Smith
Abstract: We explore Debatepedia, a communityauthored encyclopedia of sociopolitical debates, as evidence for inferring a lowdimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation’s usefulness in attaching opinionated documents to arguments and its consistency with human judgments about positions.
6 0.13160105 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model
7 0.13057148 143 emnlp-2013-Open Domain Targeted Sentiment
8 0.12005487 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression
9 0.1112202 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
10 0.10974138 94 emnlp-2013-Identifying Manipulated Offerings on Review Portals
11 0.1050544 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic
12 0.10493808 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
13 0.10404337 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
14 0.10212244 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
15 0.10084693 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
16 0.099346735 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
17 0.084231175 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs
18 0.078574032 29 emnlp-2013-Automatic Domain Partitioning for Multi-Domain Learning
19 0.073617242 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication
20 0.072464004 138 emnlp-2013-Naive Bayes Word Sense Induction
topicId topicWeight
[(0, -0.233), (1, 0.099), (2, -0.183), (3, -0.052), (4, 0.089), (5, -0.069), (6, 0.073), (7, -0.0), (8, 0.036), (9, -0.043), (10, -0.103), (11, -0.297), (12, -0.11), (13, 0.085), (14, 0.016), (15, 0.132), (16, 0.147), (17, 0.014), (18, -0.02), (19, -0.144), (20, 0.009), (21, -0.011), (22, -0.137), (23, 0.113), (24, -0.001), (25, 0.008), (26, -0.017), (27, 0.113), (28, -0.023), (29, 0.029), (30, -0.03), (31, -0.046), (32, -0.021), (33, -0.012), (34, 0.035), (35, 0.035), (36, -0.079), (37, -0.064), (38, 0.036), (39, -0.037), (40, -0.026), (41, -0.015), (42, 0.016), (43, 0.035), (44, 0.016), (45, -0.086), (46, 0.028), (47, -0.001), (48, -0.025), (49, 0.043)]
simIndex simValue paperId paperTitle
same-paper 1 0.96257967 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
Author: Zhiyuan Chen ; Arjun Mukherjee ; Bing Liu ; Meichun Hsu ; Malu Castellanos ; Riddhiman Ghosh
Abstract: Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have been used for the task. However, such models without any domain knowledge often produce aspects that are not interpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. However, existing knowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). Experiments on real-life product reviews from a variety of domains show that MCLDA outperforms the existing state-of-the-art models markedly.
2 0.79437256 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
Author: Wei Wang ; Hua Xu ; Xiaoqiu Huang
Abstract: Implicit feature detection, also known as implicit feature identification, is an essential aspect of feature-specific opinion mining but previous works have often ignored it. We think, based on the explicit sentences, several Support Vector Machine (SVM) classifiers can be established to do this task. Nevertheless, we believe it is possible to do better by using a constrained topic model instead of traditional attribute selection methods. Experiments show that this method outperforms the traditional attribute selection methods by a large margin and the detection task can be completed better.
3 0.73757547 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
Author: Hiroshi Noji ; Daichi Mochihashi ; Yusuke Miyao
Abstract: One of the language phenomena that n-gram language model fails to capture is the topic information of a given situation. We advance the previous study of the Bayesian topic language model by Wallach (2006) in two directions: one, investigating new priors to alleviate the sparseness problem caused by dividing all ngrams into exclusive topics, and two, developing a novel Gibbs sampler that enables moving multiple n-grams across different documents to another topic. Our blocked sampler can efficiently search for higher probability space even with higher order n-grams. In terms of modeling assumption, we found it is effective to assign a topic to only some parts of a document.
4 0.7351644 121 emnlp-2013-Learning Topics and Positions from Debatepedia
Author: Swapna Gottipati ; Minghui Qiu ; Yanchuan Sim ; Jing Jiang ; Noah A. Smith
Abstract: We explore Debatepedia, a communityauthored encyclopedia of sociopolitical debates, as evidence for inferring a lowdimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation’s usefulness in attaching opinionated documents to arguments and its consistency with human judgments about positions.
5 0.68530971 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
Author: Xinjie Zhou ; Xiaojun Wan ; Jianguo Xiao
Abstract: Microblog messages pose severe challenges for current sentiment analysis techniques due to some inherent characteristics such as the length limit and informal writing style. In this paper, we study the problem of extracting opinion targets of Chinese microblog messages. Such fine-grained word-level task has not been well investigated in microblogs yet. We propose an unsupervised label propagation algorithm to address the problem. The opinion targets of all messages in a topic are collectively extracted based on the assumption that similar messages may focus on similar opinion targets. Topics in microblogs are identified by hashtags or using clustering algorithms. Experimental results on Chinese microblogs show the effectiveness of our framework and algorithms.
6 0.58639199 94 emnlp-2013-Identifying Manipulated Offerings on Review Portals
7 0.56251311 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
8 0.54070503 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
9 0.50602323 133 emnlp-2013-Modeling Scientific Impact with Topical Influence Regression
10 0.48910081 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic
11 0.45554495 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model
12 0.45400426 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication
13 0.45213956 144 emnlp-2013-Opinion Mining in Newspaper Articles by Entropy-Based Word Connections
14 0.43139675 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
15 0.41981792 138 emnlp-2013-Naive Bayes Word Sense Induction
16 0.3981151 202 emnlp-2013-Where Not to Eat? Improving Public Policy by Predicting Hygiene Inspections Using Online Reviews
17 0.38558215 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
18 0.37953183 109 emnlp-2013-Is Twitter A Better Corpus for Measuring Sentiment Similarity?
20 0.37295359 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
topicId topicWeight
[(3, 0.023), (9, 0.025), (11, 0.01), (18, 0.041), (22, 0.159), (30, 0.057), (31, 0.013), (50, 0.013), (51, 0.161), (66, 0.046), (71, 0.056), (74, 0.148), (75, 0.057), (77, 0.022), (96, 0.035), (97, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.87437433 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
Author: Zhiyuan Chen ; Arjun Mukherjee ; Bing Liu ; Meichun Hsu ; Malu Castellanos ; Riddhiman Ghosh
Abstract: Aspect extraction is one of the key tasks in sentiment analysis. In recent years, statistical models have been used for the task. However, such models without any domain knowledge often produce aspects that are not interpretable in applications. To tackle the issue, some knowledge-based topic models have been proposed, which allow the user to input some prior domain knowledge to generate coherent aspects. However, existing knowledge-based topic models have several major shortcomings, e.g., little work has been done to incorporate the cannot-link type of knowledge or to automatically adjust the number of topics based on domain knowledge. This paper proposes a more advanced topic model, called MC-LDA (LDA with m-set and c-set), to address these problems, which is based on an Extended generalized Pólya urn (E-GPU) model (which is also proposed in this paper). Experiments on real-life product reviews from a variety of domains show that MCLDA outperforms the existing state-of-the-art models markedly.
2 0.84118015 170 emnlp-2013-Sentiment Analysis: How to Derive Prior Polarities from SentiWordNet
Author: Marco Guerini ; Lorenzo Gatti ; Marco Turchi
Abstract: Assigning a positive or negative score to a word out of context (i.e. a word’s prior polarity) is a challenging task for sentiment analysis. In the literature, various approaches based on SentiWordNet have been proposed. In this paper, we compare the most often used techniques together with newly proposed ones and incorporate all of them in a learning framework to see whether blending them can further improve the estimation of prior polarity scores. Using two different versions of SentiWordNet and testing regression and classification models across tasks and datasets, our learning approach consistently outperforms the single metrics, providing a new state-ofthe-art approach in computing words’ prior polarity for sentiment analysis. We conclude our investigation showing interesting biases in calculated prior polarity scores when word Part of Speech and annotator gender are considered.
3 0.82444251 41 emnlp-2013-Building Event Threads out of Multiple News Articles
Author: Xavier Tannier ; Veronique Moriceau
Abstract: We present an approach for building multidocument event threads from a large corpus of newswire articles. An event thread is basically a succession of events belonging to the same story. It helps the reader to contextualize the information contained in a single article, by navigating backward or forward in the thread from this article. A specific effort is also made on the detection of reactions to a particular event. In order to build these event threads, we use a cascade of classifiers and other modules, taking advantage of the redundancy of information in the newswire corpus. We also share interesting comments concerning our manual annotation procedure for building a training and testing set1.
4 0.81796956 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing
Author: Wenliang Chen ; Min Zhang ; Yue Zhang
Abstract: In current dependency parsing models, conventional features (i.e. base features) defined over surface words and part-of-speech tags in a relatively high-dimensional feature space may suffer from the data sparseness problem and thus exhibit less discriminative power on unseen data. In this paper, we propose a novel semi-supervised approach to addressing the problem by transforming the base features into high-level features (i.e. meta features) with the help of a large amount of automatically parsed data. The meta features are used together with base features in our final parser. Our studies indicate that our proposed approach is very effective in processing unseen data and features. Experiments on Chinese and English data sets show that the final parser achieves the best-reported accuracy on the Chinese data and comparable accuracy with the best known parsers on the English data.
5 0.81082165 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning
Author: Lei Cui ; Xilun Chen ; Dongdong Zhang ; Shujie Liu ; Mu Li ; Ming Zhou
Abstract: Domain adaptation for SMT usually adapts models to an individual specific domain. However, it often lacks some correlation among different domains where common knowledge could be shared to improve the overall translation quality. In this paper, we propose a novel multi-domain adaptation approach for SMT using Multi-Task Learning (MTL), with in-domain models tailored for each specific domain and a general-domain model shared by different domains. The parameters of these models are tuned jointly via MTL so that they can learn general knowledge more accurately and exploit domain knowledge better. Our experiments on a largescale English-to-Chinese translation task validate that the MTL-based adaptation approach significantly and consistently improves the translation quality compared to a non-adapted baseline. Furthermore, it also outperforms the individual adaptation of each specific domain.
6 0.80223644 25 emnlp-2013-Appropriately Incorporating Statistical Significance in PMI
7 0.80181342 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles
8 0.78130072 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
9 0.77530837 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification
10 0.77095461 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
11 0.76243138 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
12 0.75851679 118 emnlp-2013-Learning Biological Processes with Global Constraints
13 0.75699353 152 emnlp-2013-Predicting the Presence of Discourse Connectives
14 0.75491107 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training
15 0.74851024 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
16 0.74835527 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
17 0.7474485 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
18 0.74711376 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
19 0.74291122 65 emnlp-2013-Document Summarization via Guided Sentence Compression
20 0.74186808 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging