acl acl2011 acl2011-150 knowledge-graph by maker-knowledge-mining

150 acl-2011-Hierarchical Text Classification with Latent Concepts

Source: pdf

Author: Xipeng Qiu ; Xuanjing Huang ; Zhao Liu ; Jinlong Zhou

Abstract: Recently, hierarchical text classification has become an active research topic. The essential idea is that the descendant classes can share the information of the ancestor classes in a predefined taxonomy. In this paper, we claim that each class has several latent concepts and its subclasses share information with these different concepts respectively. Then, we propose a variant Passive-Aggressive (PA) algorithm for hierarchical text classification with latent concepts. Experimental results show that the performance of our algorithm is competitive with the recently proposed hierarchical classification algorithms.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Hierarchical Text Classification with Latent Concepts Xipeng Qiu, Xuanjing Huang, Zhao Liu and Jinlong Zhou School of Computer Science, Fudan University {xpqiu ,x j huang} @ fudan . [sent-1, score-0.074]

2 cn , Abstract Recently, hierarchical text classification has become an active research topic. [sent-3, score-0.541]

3 The essential idea is that the descendant classes can share the information of the ancestor classes in a predefined taxonomy. [sent-4, score-0.341]

4 In this paper, we claim that each class has several latent concepts and its subclasses share information with these different concepts respectively. [sent-5, score-1.009]

5 Then, we propose a variant Passive-Aggressive (PA) algorithm for hierarchical text classification with latent concepts. [sent-6, score-0.814]

6 Experimental results show that the performance of our algorithm is competitive with the recently proposed hierarchical classification algorithms. [sent-7, score-0.531]

7 1 Introduction Text classification is a crucial and well-proven method for organizing the collection of large scale documents. [sent-8, score-0.198]

8 The predefined categories are formed by different criterions, e. [sent-9, score-0.103]

9 “Entertainment”, “Sport- s” and “Education” in news classification, “Junk Email” and “Ordinary Email” in email classification. [sent-11, score-0.097]

10 Empirical evaluations have shown that most of these methods are quite effective in traditional text classification applications. [sent-13, score-0.23]

11 In past serval years, hierarchical text classification has become an active research topic in database area (Koller and Sahami, 1997; Weigend et al. [sent-14, score-0.541]

12 Different with traditional classification, the document collections are organized 598 { z l . [sent-17, score-0.069]

13 as hierarchical class structure in many application fields: web taxonomies (i. [sent-19, score-0.473]

14 The approaches of hierarchical text classification can be divided in three ways: flat, local and global approaches. [sent-26, score-0.58]

15 The flat approach is traditional multi-class classification in flat fashion without hierarchical class information, which only uses the classes in leaf nodes in taxonomy(Yang and Liu, 1999; Yang and Pedersen, 1997; Qiu et al. [sent-27, score-0.917]

16 The local approach proceeds in a top-down fashion, which firstly picks the most relevant categories of the top level and then recursively making the choice among the low-level categories(Sun and Lim, 2001 ; Liu et al. [sent-29, score-0.068]

17 The global approach builds only one classifier to discriminate all categories in a hierarchy(Cai and Hofmann, 2004; Rousu et al. [sent-31, score-0.157]

18 The essential idea of global approach is that the close classes have some common underlying factors. [sent-34, score-0.129]

19 Especially, the descendant classes can share the characteristics of the ancestor classes, which is similar with multi-task learning(Caruana, 1997; Xue et al. [sent-35, score-0.25]

20 Because the global hierarchical categorization can avoid the drawbacks about those high-level irrecoverable error, it is more popular in the machine learn- ing domain. [sent-37, score-0.454]

21 However, the taxonomy is defined artificially and is usually very difficult to organize for large scale taxonomy. [sent-38, score-0.189]

22 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 598–602, (a) (b) Figure 1: Example of latent nodes in taxonomy cal classification. [sent-41, score-0.375]

23 For example, the “Sports” node in a taxonomy have six subclasses (Fig. [sent-42, score-0.345]

24 1a), but these subclass can be grouped into three unobservable concepts (Fig. [sent-43, score-0.346]

25 These concepts can show the underlying factors more clearly. [sent-45, score-0.272]

26 In this paper, we claim that each class may have several latent concepts and its subclasses share information with these different concepts respectively. [sent-46, score-1.009]

27 Then we propose a variant Passive-Aggressive (PA) algorithm to maximizes the margins between latent paths. [sent-47, score-0.273]

28 Section 2 describes the basic model of hierarchical classification. [sent-49, score-0.344]

29 2 Hierarchical Text Classification In text classification, the documents are often represented with vector space model (VSM) (Salton et al. [sent-53, score-0.049]

30 Following (Cai and Hofmann, 2007), we incorporate the hierarchical information in feature representation. [sent-55, score-0.344]

31 The basic idea is that the notion of class attributes will allow generalization to take place across (similar) categories and not just across training examples belonging to the same category. [sent-56, score-0.133]

32 Assuming that the categories is Ω = [ω1 , · · · , ωm], where m is the number of the categories, which are organized in hierarchical structure, such as tree or DAG. [sent-57, score-0.448]

33 Give a sample x with its class path in the taxonomy y, we define the feature is Φ(x, y) = Λ(y) ⊗ x, where Λ(y) = (λ1(y), · · is the Kronecker product. [sent-58, score-0.34]

34 We can define · (1) 599 , λm(y))T ∈ Rm and ⊗ λi(y) ={ t0i oitfh eωriw∈i yse ,(2) where ti >= 0 is the attribute value for node v. [sent-59, score-0.071]

35 3 Hierarchical Text Classification with Latent Concepts (3) In this section, we first extent the PassiveAggressive (PA) algorithm to the hierarchical classification (HPA), then we modify it to incorporate latent concepts (LHPA). [sent-62, score-0.961]

36 1 Hierarchical Passive-Aggressive Algorithm The PA algorithm is an online learning algorithm, which aims to find the new weight vector wt+1 to be the solution to the following constrained optimization problem in round t. [sent-64, score-0.039]

37 (4) where ℓ(w; (xt, yt)) is the hinge-loss function and ξ is slack variable. [sent-68, score-0.032]

38 Since the hierarchical text classification is losssensitive based on the hierarchical structure. [sent-69, score-0.885]

39 We need discriminate the misclassification from ly correct” to “clearly incorrect”. [sent-70, score-0.087]

40 Here we use tree induced error ∆(y, y′), which is the shortest path connecting the nodes yleaf and yl′eaf. [sent-71, score-0.302]

41 “near- Given a example (x, y), we look for the w to maximize the separation margin γ(w; (x, y)) between the score of the correct path y and the closest error path yˆ. [sent-73, score-0.356]

42 Unlike the standard PA algorithm, which achieve a margin of at least 1 as often as possible, we wish the margin is related to tree induced error ∆(y, ˆy ). [sent-75, score-0.216]

43 If ℓ = 0 then wt itself satisfies the constraint in Eq. [sent-77, score-0.269]

44 w − wt − α(Φ(x, y) − Φ(x, ˆy )) = 0 (9) Then we get, w = wt + α(Φ(x, y) − Φ(x, ˆy )) . [sent-89, score-0.538]

45 (7), we get + L(α) = −21α2||Φ(x,y) − Φ(x, yˆ )||2 + αwt (Φ(x, y) − Φ(x, ˆ y))) − α∆(y, yˆ) (11) Differentiate Eq. [sent-93, score-0.036]

46 (11with α, and set it to zero, we get α∗=∆(y, yˆ||)Φ −(x w,yt(Φ) −(x Φ,y(x), −ˆ y Φ)||(2x, yˆ ))) From α + β = C, we know that α (12) < C, so α∗= min(C,∆(y,ˆ y | )Φ −(x w,yt(Φ) −(x Φ,(yx), − ˆ y Φ)||(2x, yˆ )))). [sent-94, score-0.036]

47 2 Hierarchical Passive-Aggressive Algorithm with Latent Concepts For the hierarchical taxonomy Ω = (ω1 , · · · ,ωc), we define that each class ωi has a set, Hωi = h1ωi , · · · , hωmi with m latent concepts, which are unobse,·rv·a·b ,lhe. [sent-96, score-0.74]

48 Given a label path y, it has a set of several latent paths Hy. [sent-97, score-0.328]

49 For a latent path z ∈ Hy, a function Proj(z) y irs tah lea projection zfr ∈om H a latent path z to its corresponding path y. [sent-98, score-0.824]

50 Then we can define the predict latent path h∗ and =. [sent-99, score-0.328]

51 the most correct latent pathhˆ: hˆ = argpromj(azx)̸=ywTΦ(x,z), (14) h∗= argpromj(azx)=ywTΦ(x,z). [sent-101, score-0.192]

52 (15) Similar to the above analysis ofHPA, we re-define the margin γ(w; (x, y) = wTΦ(x, h∗) − wTΦ(x, then we get the optimal update step ˆh), (16) αL∗= min(C,| Φ(xℓ,(hw∗)t; −(x Φ,y(x) ,hˆ)| 2). [sent-102, score-0.12]

53 (17) Finally, we get update strategy, w = wt + αL∗ (Φ(x, h∗) − Φ(x, ˆh)) . [sent-103, score-0.305]

54 (18) Our hierarchical passive-aggressive algorithm with latent concepts (LHPA) is shown in Algorithm 1. [sent-104, score-0.813]

55 In this paper, we use two latent concepts for each class. [sent-105, score-0.43]

56 1 Datasets We evaluate our proposed algorithm on two datasets with hierarchical category structure. [sent-107, score-0.383]

57 WIPO-alpha dataset The dataset1 consisted of the 1372 training and 358 testing document comprising the D section of the hierarchy. [sent-108, score-0.059]

58 The number of nodes in the hierarchy was 188, with maximum depth 3. [sent-109, score-0.098]

59 The dataset was processed into bag-of-words representation with TF·IDF into bag- 1World Intellectual Property Organization, wipo . [sent-110, score-0.059]

60 LSHTC dataset The dataset2 has been constructed by crawling web pages that are found in the Open Directory Project (ODP) and translating them into feature vectors (content vectors) and splitting the set of Web pages into a training, a validation and a test set, per ODP category. [sent-116, score-0.059]

61 2 Performance Measurement Macro Precision, Macro Recall and Macro F1are the most widely used performance measurements for text classification problems nowadays. [sent-119, score-0.197]

62 The macro strategy computes macro precision and recall scores by averaging the precision/recall of each category, which is preferred because the categories are usually unbalanced and give more challenges to classifiers. [sent-120, score-0.584]

63 MacroF1 =PP × + R R, (19) 2Large Scale Hierarchical Text classification Pascal Challenge, http : / / l c . [sent-122, score-0.148]

64 We also use tree induced error (TIE) in the experiments. [sent-139, score-0.048]

65 3 Results We implement three algorithms3: PA(Flat PA), HPA(Hierarchical PA) and LHPA(Hierarchical PA with latent concepts). [sent-141, score-0.192]

66 From Table 2, we can see that it is not always useful to incorporate the hierarchical information. [sent-146, score-0.344]

67 Though the subclasses can share information with their parent class, the shared information may be different for each subclass. [sent-147, score-0.233]

68 So we should decompose the underlying factors into different latent concepts. [sent-148, score-0.226]

69 5 Conclusion In this paper, we propose a variant PassiveAggressive algorithm for hierarchical text classification with latent concepts. [sent-149, score-0.814]

70 Support vector machines classification with a very large-scale taxonomy. [sent-191, score-0.148]

71 In Large Scale Hierarchical Text classification (LSHTC) Pascal Challenge. [sent-196, score-0.148]

72 Hierarchical multi-class text categorization with glob- al margin maximization. [sent-199, score-0.204]

73 A comparative study on feature selection in text categorization. [sent-254, score-0.049]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hierarchical', 0.344), ('wt', 0.269), ('macro', 0.258), ('concepts', 0.238), ('lhpa', 0.211), ('rousu', 0.211), ('qiu', 0.193), ('latent', 0.192), ('subclasses', 0.172), ('xipeng', 0.169), ('classification', 0.148), ('taxonomy', 0.139), ('path', 0.136), ('cai', 0.128), ('lshtc', 0.127), ('xuanjing', 0.112), ('odp', 0.112), ('pa', 0.099), ('email', 0.097), ('directory', 0.092), ('yang', 0.087), ('argpromj', 0.085), ('hpa', 0.085), ('jinlong', 0.085), ('weigend', 0.085), ('ywt', 0.085), ('margin', 0.084), ('hofmann', 0.077), ('flat', 0.075), ('descendant', 0.074), ('miao', 0.074), ('fudan', 0.074), ('yleaf', 0.074), ('categorization', 0.071), ('azx', 0.069), ('categories', 0.068), ('class', 0.065), ('taxonomies', 0.064), ('passiveaggressive', 0.061), ('share', 0.061), ('dataset', 0.059), ('salton', 0.059), ('ancestor', 0.059), ('classes', 0.056), ('hierarchy', 0.054), ('koller', 0.05), ('discriminate', 0.05), ('xt', 0.05), ('pascal', 0.05), ('scale', 0.05), ('text', 0.049), ('pedersen', 0.049), ('yt', 0.049), ('liu', 0.048), ('induced', 0.048), ('nodes', 0.044), ('xue', 0.043), ('leaf', 0.043), ('claim', 0.043), ('variant', 0.042), ('gradient', 0.04), ('algorithm', 0.039), ('global', 0.039), ('sandor', 0.037), ('kronecker', 0.037), ('misclassification', 0.037), ('argmaxz', 0.037), ('argmyaxf', 0.037), ('juho', 0.037), ('junk', 0.037), ('eyrw', 0.037), ('proj', 0.037), ('riw', 0.037), ('subclass', 0.037), ('unobservable', 0.037), ('yse', 0.037), ('get', 0.036), ('organized', 0.036), ('icml', 0.036), ('predefined', 0.035), ('acm', 0.034), ('node', 0.034), ('grouped', 0.034), ('fashion', 0.034), ('saunders', 0.034), ('folders', 0.034), ('sahami', 0.034), ('caruana', 0.034), ('abc', 0.034), ('intellectual', 0.034), ('knn', 0.034), ('vsm', 0.034), ('underlying', 0.034), ('traditional', 0.033), ('irs', 0.032), ('slack', 0.032), ('hy', 0.032), ('liao', 0.032), ('multilabel', 0.032), ('tent', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 150 acl-2011-Hierarchical Text Classification with Latent Concepts

Author: Xipeng Qiu ; Xuanjing Huang ; Zhao Liu ; Jinlong Zhou

2 0.15928322 14 acl-2011-A Hierarchical Model of Web Summaries

Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani

Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.

3 0.10764544 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

Author: Viet Ha Thuc ; Nicola Cancedda

Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.

4 0.092622899 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

Author: Markos Mylonakis ; Khalil Sima'an

Abstract: While it is generally accepted that many translation phenomena are correlated with linguistic structures, employing linguistic syntax for translation has proven a highly non-trivial task. The key assumption behind many approaches is that translation is guided by the source and/or target language parse, employing rules extracted from the parse tree or performing tree transformations. These approaches enforce strict constraints and might overlook important translation phenomena that cross linguistic constituents. We propose a novel flexible modelling approach to introduce linguistic information of varying granularity from the source side. Our method induces joint probability synchronous grammars and estimates their parameters, by select- ing and weighing together linguistically motivated rules according to an objective function directly targeting generalisation over future data. We obtain statistically significant improvements across 4 different language pairs with English as source, mounting up to +1.92 BLEU for Chinese as target.

5 0.092528462 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

Abstract: Comparative News Summarization aims to highlight the commonalities and differences between two comparable news topics. In this study, we propose a novel approach to generating comparative news summaries. We formulate the task as an optimization problem of selecting proper sentences to maximize the comparativeness within the summary and the representativeness to both news topics. We consider semantic-related cross-topic concept pairs as comparative evidences, and consider topic-related concepts as representative evidences. The optimization problem is addressed by using a linear programming model. The experimental results demonstrate the effectiveness of our proposed model.

6 0.088093363 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

7 0.077641256 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

8 0.071826927 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

9 0.070878863 204 acl-2011-Learning Word Vectors for Sentiment Analysis

10 0.070647016 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

11 0.069679402 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

12 0.068574846 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

13 0.066196106 334 acl-2011-Which Noun Phrases Denote Which Concepts?

14 0.065714955 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

15 0.065562204 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

16 0.061882589 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

17 0.059287243 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts

18 0.059265181 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering

19 0.057643272 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

20 0.05671351 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.151), (1, 0.034), (2, -0.022), (3, 0.015), (4, -0.008), (5, -0.04), (6, -0.066), (7, 0.058), (8, -0.032), (9, 0.019), (10, -0.018), (11, -0.023), (12, 0.031), (13, 0.045), (14, 0.024), (15, -0.021), (16, -0.053), (17, -0.044), (18, 0.007), (19, 0.034), (20, 0.016), (21, -0.016), (22, 0.028), (23, -0.002), (24, -0.066), (25, -0.042), (26, 0.012), (27, -0.045), (28, -0.005), (29, 0.04), (30, -0.052), (31, 0.023), (32, -0.047), (33, 0.04), (34, 0.008), (35, 0.036), (36, 0.044), (37, -0.022), (38, 0.025), (39, 0.059), (40, 0.097), (41, -0.048), (42, -0.011), (43, -0.083), (44, -0.067), (45, -0.094), (46, -0.105), (47, 0.02), (48, 0.05), (49, -0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95629096 150 acl-2011-Hierarchical Text Classification with Latent Concepts

Author: Xipeng Qiu ; Xuanjing Huang ; Zhao Liu ; Jinlong Zhou

2 0.6143012 14 acl-2011-A Hierarchical Model of Web Summaries

Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani

3 0.57843614 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

4 0.57495975 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

Author: Nikhil Garg ; James Henderson

Abstract: We propose a generative model based on Temporal Restricted Boltzmann Machines for transition based dependency parsing. The parse tree is built incrementally using a shiftreduce parse and an RBM is used to model each decision step. The RBM at the current time step induces latent features with the help of temporal connections to the relevant previous steps which provide context information. Our parser achieves labeled and unlabeled attachment scores of 88.72% and 91.65% respectively, which compare well with similar previous models and the state-of-the-art.

5 0.54676509 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

6 0.54591477 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

7 0.54530483 342 acl-2011-full-for-print

8 0.53226149 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

9 0.53178853 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation

10 0.5276829 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

11 0.52100742 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

12 0.50420409 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

13 0.50309587 187 acl-2011-Jointly Learning to Extract and Compress

14 0.50078094 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

15 0.49706054 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

16 0.49564794 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

17 0.49024376 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

18 0.48733422 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

19 0.48342291 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing

20 0.48071495 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.016), (17, 0.043), (26, 0.039), (37, 0.141), (39, 0.053), (41, 0.038), (44, 0.301), (55, 0.09), (59, 0.018), (72, 0.041), (91, 0.029), (96, 0.112)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76037145 150 acl-2011-Hierarchical Text Classification with Latent Concepts

Author: Xipeng Qiu ; Xuanjing Huang ; Zhao Liu ; Jinlong Zhou

2 0.73030043 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond

Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.

3 0.70025373 135 acl-2011-Faster and Smaller N-Gram Language Models

Author: Adam Pauls ; Dan Klein

Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.

4 0.67847687 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing

Author: Nikhil Garg ; James Henderson

5 0.64012384 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

Author: Xianpei Han ; Le Sun

Abstract: Linking entities with knowledge base (entity linking) is a key issue in bridging the textual data with the structural knowledge base. Due to the name variation problem and the name ambiguity problem, the entity linking decisions are critically depending on the heterogenous knowledge of entities. In this paper, we propose a generative probabilistic model, called entitymention model, which can leverage heterogenous entity knowledge (including popularity knowledge, name knowledge and context knowledge) for the entity linking task. In our model, each name mention to be linked is modeled as a sample generated through a three-step generative story, and the entity knowledge is encoded in the distribution of entities in document P(e), the distribution of possible names of a specific entity P(s|e), and the distribution of possible contexts of a specific entity P(c|e). To find the referent entity of a name mention, our method combines the evidences from all the three distributions P(e), P(s|e) and P(c|e). Experimental results show that our method can significantly outperform the traditional methods. 1

6 0.59467387 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices

7 0.56744832 237 acl-2011-Ordering Prenominal Modifiers with a Reranking Approach

8 0.56020409 85 acl-2011-Coreference Resolution with World Knowledge

9 0.55290055 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers

10 0.55282748 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines

11 0.5489099 275 acl-2011-Semi-Supervised Modeling for Prenominal Modifier Ordering

12 0.54860628 256 acl-2011-Query Weighting for Ranking Model Adaptation

13 0.54809976 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models

14 0.54692304 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

15 0.5468359 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

16 0.5424397 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification

17 0.54235661 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

18 0.54223967 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling

19 0.54213369 292 acl-2011-Target-dependent Twitter Sentiment Classification

20 0.54163361 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic