acl acl2011 acl2011-76 knowledge-graph by maker-knowledge-mining

76 acl-2011-Comparative News Summarization Using Linear Programming

Source: pdf

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

Abstract: Comparative News Summarization aims to highlight the commonalities and differences between two comparable news topics. In this study, we propose a novel approach to generating comparative news summaries. We formulate the task as an optimization problem of selecting proper sentences to maximize the comparativeness within the summary and the representativeness to both news topics. We consider semantic-related cross-topic concept pairs as comparative evidences, and consider topic-related concepts as representative evidences. The optimization problem is addressed by using a linear programming model. The experimental results demonstrate the effectiveness of our proposed model.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this study, we propose a novel approach to generating comparative news summaries. [sent-2, score-0.645]

2 We formulate the task as an optimization problem of selecting proper sentences to maximize the comparativeness within the summary and the representativeness to both news topics. [sent-3, score-0.795]

3 We consider semantic-related cross-topic concept pairs as comparative evidences, and consider topic-related concepts as representative evidences. [sent-4, score-0.91]

4 The optimization problem is addressed by using a linear programming model. [sent-5, score-0.103]

5 1 Introduction Comparative News Summarization aims to highlight , the commonalities and differences between two comparable news topics. [sent-7, score-0.389]

6 For example, by comparing the information about mining accidents in Chile and China, we can discover what leads to the different endings and how to avoid those tragedies. [sent-9, score-0.056]

7 The proposed works differ in the domain of corpus, the source of comparison and the representing form of results. [sent-11, score-0.053]

8 , 2005; Jindal and Liu, 2006a; ∗Corresponding author 648 xiao j i anguo} @ i st . [sent-13, score-0.034]

9 A reason is that the aspects in reviews are easy to be extracted and the comparisons have simple patterns, e. [sent-17, score-0.197]

10 A few other works have also tried to compare facts and views in news article (Zhai et al. [sent-21, score-0.143]

11 The comparative information can be extracted from explicit comparative sentences (Jindal and Liu, 2006a; Jindal and Liu, 2006b; Huang et al. [sent-24, score-1.047]

12 , 2008), or mined implicitly by matching up features of objects in the same aspects (Zhai et al. [sent-25, score-0.142]

13 The comparisons can be represented by charts (Liu et al. [sent-29, score-0.095]

14 , 2006), and summaries which consist of pairs of sentences or text sections (Kim and Zhai, 2009; Lerman and McDonald, 2009; Wang et al. [sent-32, score-0.113]

15 Among these forms, the comparative summary conveys rich information with good readability, so it keeps attracting interest in the research community. [sent-34, score-0.788]

16 In general, document summarization can be performed by extraction or abstraction (Mani, 2001). [sent-35, score-0.234]

17 Due to the difficulty of natural sentence generation, most automatic summarization systems are extraction-based. [sent-36, score-0.195]

18 They select salient sentences to maximize the objective functions of generated summaries (Carbonell and Goldstein, 1998; McDonald, 2007; Lerman and McDonald, 2009; Kim and Zhai, 2009; Gillick et al. [sent-37, score-0.192]

19 The major difference between the traditional summarization task and the comparative summarization task is that traditional summarization task places equal emphasis on all kinds of information in Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-39, score-1.087]

20 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 648–653, the source, while comparative summarization task only focuses on the comparisons between objects. [sent-41, score-0.792]

21 However, it is more difficult to extract comparisons in news articles than in reviews. [sent-43, score-0.275]

22 These aspects can be expressed explicitly or implicitly in many ways. [sent-46, score-0.142]

23 All these issues raise great challenges to comparative summarization in the news domain. [sent-48, score-0.84]

24 In this study, we propose a novel approach for comparative news summarization. [sent-49, score-0.645]

25 We consider comparativeness and representativeness as well as redundancy in an objective function, and solve the optimization problem by using linear programming to extract proper comparable sentences. [sent-50, score-0.445]

26 More specifically, we consider a pair of sentences comparative if they share comparative concepts; we also consider a sentence representative if it contains important concepts about the topic. [sent-51, score-1.304]

27 Thus a good comparative summary contains important comparative pairs, as well as important concepts about individual topics. [sent-52, score-1.455]

28 Experimental results demonstrate the effectiveness of our model, which outperforms the baseline systems in quality of comparison identification and summarization. [sent-53, score-0.053]

29 1 Comparison A comparison identifies the commonalities or differences among objects. [sent-55, score-0.222]

30 It basically consists of four components: the comparee (i. [sent-56, score-0.126]

31 the scale on which the comparee and standard are measured), and the result (i. [sent-62, score-0.126]

32 the predicate that describes the positions of the comparee and standard). [sent-64, score-0.126]

33 ” is a typical comparison, where the comparee is “Chile”; the standard is “Haiti”; the comparative aspect is wealth, which is implied by “richer”; and the result is that Chile is superior to Haiti. [sent-66, score-0.67]

34 A comparison can be expressed explicitly in a 649 comparative sentence, or be described implicitly in a section of text which describes the individual characteristics of each object point-by-point. [sent-67, score-0.595]

35 2 Comparative News Summarization The task of comparative news summarization is to briefly sum up the commonalities and differences between two comparable news topics by using human readable sentences. [sent-72, score-1.285]

36 The summarization system is given two collections of news articles, each of which is related to a topic. [sent-73, score-0.338]

37 The system should find latent comparative aspects, and generate descriptions of those aspects in a pairwise way, i. [sent-74, score-0.694]

38 including descriptions of two topics simultaneously in each aspect. [sent-76, score-0.146]

39 For example, when comparing the earthquake in Haiti with the one in Chile, the summary should contain the intensity of each temblor, the damages in each disaster area, the reactions of each government, etc. [sent-77, score-0.241]

40 Formally, let t1 and t2 be two comparable news topics, and D1 and D2 be two collections of articles about each topic respectively. [sent-78, score-0.284]

41 The task of comparative summarization is to generate a short abstract which conveys the important comparisons {< t1, t2, r1i, r2i >}, where r1i and r2i are descriptions about topic t1 earned t2 in the same latent aspect ai respectively. [sent-79, score-1.025]

42 The summary can be considered as a combination of two components, each of which is related to a news topic. [sent-80, score-0.384]

43 The coverage of comparisons should be as wide as possible, which means the aspects should not be redundant because of the length limit. [sent-85, score-0.197]

44 3 Proposed Approach It is natural to select the explicit comparative sentences as comparative summary, because they express comparison explicitly in good qualities. [sent-86, score-1.1]

45 However, they do not appear frequently in regular news articles so that the coverage is limited. [sent-87, score-0.18]

46 Instead, it is more feasible to extract individual descriptions of each topic over the same aspects and then generate comparisons. [sent-88, score-0.219]

47 To discover latent comparative aspects, we consider a sentence as a bag of concepts, each of which has an atom meaning. [sent-89, score-0.531]

48 If two sentences have same concepts in common, they are likely to discuss the same aspect and thus they may be comparable with each other. [sent-90, score-0.343]

49 The two sentences compare on the “FIFA Word Player of the Year”, which is contained in both sentences. [sent-93, score-0.043]

50 For example, “snow” and “sunny” can indicate a comparison on “weather”; “alive” and “death” can imply a comparison on “rescue result”. [sent-95, score-0.106]

51 Thus the pairs of semantic related concepts can be considered as evidences of comparisons. [sent-96, score-0.269]

52 A comparative summary should contain as many comparative evidences as possible. [sent-97, score-1.304]

53 Since we model the text with a collection of concept units, the summary should contain as many important concepts as possible. [sent-99, score-0.602]

54 An important concept is likely to be mentioned frequently in the documents, and thus we use the frequency as a measure of a concept’s importance. [sent-100, score-0.151]

55 Obviously, the more accurate the extracted concepts are, the better we can represent the meaning of a text. [sent-101, score-0.21]

56 However, it is not easy to extract semantic concepts accurately. [sent-102, score-0.21]

57 In this study, we use words, named entities and bigrams to simply represent concepts, and leave the more complex concept extraction for future work. [sent-103, score-0.151]

58 Based on the above ideas, we can formulate the summarization task as an optimization problem. [sent-104, score-0.294]

59 Formally, let Ci = {cij} be the set of concepts in the document set Di, (i = 1, 2). [sent-105, score-0.249]

60 ocij ∈ {0, 1} is a binary variable indicating w∈he tRh. [sent-107, score-0.156]

61 er o tche concept cij si sa presented irina tbhlee summary. [sent-108, score-0.375]

62 A cross-topic concept pair < c1j , c2k > has a weight ujk ∈ R that indicates whether it implies a important comparison. [sent-109, score-0.245]

63 opjk i sw a binary 650 variable indicating whether the pair is presented in the summary. [sent-110, score-0.248]

64 The weights of concepts are calculated as follows: wij = tfij · idfij (2) where tfij is the term frequency of the concept cij in the document set Di, and idfij is the inverse document frequency calculated over a background corpus. [sent-116, score-0.937]

65 The weights of concept pairs are calculated as follows: ujk={(0w,1j+ w2k)/2, iofth reerl(wci1sje,c2k) > τ (3) where rel(c1j, c2k) is the semantic relevance between two concepts, and it is calculated using the algorithms basing on WordNet (Pedersen et al. [sent-117, score-0.209]

66 2 in this study), then the concept pair is considered as an evidence of comparison. [sent-120, score-0.151]

67 Note that a concept pair will not be presented in the summary unless both the concepts are presented, i. [sent-121, score-0.602]

68 opjk ≤ oc1j (4) opjk ≤ oc2k (5) In order to avoid bias towards the concepts which have more related concepts, we only count the most important relation of each concept, i. [sent-123, score-0.586]

69 ∑opjk ≤ 1,∀j ∑k (6) ∑opjk ≤ 1,∀k (7) ∑j The algorithm selects proper sentences to maximize the objective function. [sent-125, score-0.122]

70 Formally, let Si = {sik} be the set of sentences in Di, ocsijk be a binary v tahreiab selet indicating ewsh einthe Dr concept cij occurs in sentence sik, and osik be a binary variable indicating whether sik is presented in the summary. [sent-126, score-0.739]

71 If sik is selected in the summary, then all the concepts in it are presented in the summary, i. [sent-127, score-0.348]

72 ocij ≥ ocsijk · osik, ∀1 ≤ j ≤ |Ci| (8) Meanwhile, a concept will not be present in the summary unless it is contained in some selected sentences, i. [sent-129, score-0.612]

73 ocij≤∑|Si|ocsijk· osik k∑= ∑1 (9) Finally, the summary constraint: should satisfy a length ∑2∑|Si|lik· osik≤ L ∑i=1 k∑= ∑1 (10) where lik is the length of sentence sik, and L is the maximal summary length. [sent-131, score-0.663]

74 The optimization of the defined objective function under above constraints is an integer linear programming (ILP) problem. [sent-132, score-0.141]

75 1 Dataset Because of the novelty of the comparative news summarization task, there is no existing data set for evaluating. [sent-135, score-0.84]

76 We first choose five pairs of comparable topics, then retrieve ten related news articles for each topic using the Google News2 search engine. [sent-137, score-0.284]

77 Finally we write the comparative summary for each topic pair manually. [sent-138, score-0.799]

78 2 Evaluation Metrics We evaluate the models with following measures: Comparison Precision / Recall / F-measure: let aa and am be the numbers of all aspects 1We use IBM ILOG CPLEX optimizer to solve the problem. [sent-141, score-0.13]

79 com 651 Table 1: Comparable topic pairs in the dataset. [sent-144, score-0.056]

80 involved in the automatically generated summary and manually written summary respectively; ca be the number of human agreed comparative aspects in the automatically generated summary. [sent-145, score-1.086]

81 The comparison precision (CP), comparison recall (CR) and comparison F-measure (CF) are defined as follows: CP =acaa; CR =acam; CF =2C ·P CP + · C CRR ROUGE: the ROUGE is a widely used metric in summarization evaluation. [sent-146, score-0.354]

82 It measures summary quality by counting overlapping units between the candidate summary and the reference summary (Lin and Hovy, 2003). [sent-147, score-0.723]

83 To evaluate whether the summary is related to both topics, we also split each comparative summary into two topic-related parts, evaluate them respectively, and report the mean of the two ROUGE values (denoted as MROUGE). [sent-149, score-0.984]

84 3 Baseline Systems Non-Comparative Model (NCM): The non-comparative model treats the task as a traditional summarization problem and selects the important sentences from each document collection. [sent-151, score-0.277]

85 Co-Ranking Model (CRM): The co-ranking model makes use of the relations within each topic and relations across the topics to reinforce scores of the comparison related sentences. [sent-153, score-0.194]

86 The SS, WW and SW relationships are replaced by relationships between two sentences within each topic and relationships between two sentences from different topics. [sent-156, score-0.232]

87 4 Experiment Results We apply all the systems to generate comparative summaries with a length limit of 200 words. [sent-158, score-0.572]

88 Compared with baseline models, our linear programming based comparative model (denoted as LPCM) achieves best scores over all metrics. [sent-160, score-0.545]

89 The CRM model utilizes the similarity between two topics to enhance the score of comparison related sentences. [sent-162, score-0.138]

90 However, it does not guarantee to choose pairwise sentences to form comparisons. [sent-163, score-0.043]

91 The LPCM model focus on both comparativeness and representativeness at the same time, and thus it achieves good performance on both comparison extraction and summarization. [sent-164, score-0.281]

92 Figure 1 shows an example of comparative summary generated by using the CLPM model. [sent-165, score-0.743]

93 The summary describes several comparisons between two FIFA World Cups in 2006 and 2010. [sent-166, score-0.336]

94 5 Conclusion In this study, we propose a novel approach to summing up the commonalities and differences between two news topics. [sent-168, score-0.312]

95 We formulate the task as an optimization problem of selecting sentences to maximize the score of comparative and representative evidences. [sent-169, score-0.732]

96 The experiment results show that our model is effective in comparison extraction and summarization. [sent-170, score-0.053]

97 In future work, we will utilize more semantic information such as localized latent topics to help capture comparative aspects, and use machine learning technologies to tune weights of concepts. [sent-171, score-0.616]

98 119725 Table 2: Evaluation results of systems Figure 1: A sample comparative summary generated by using the LPCM model 652 References Jaime Carbonell and Jade Goldstein. [sent-200, score-0.743]

99 Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. [sent-270, score-0.234]

100 A cross-collection mixture model for comparative text mining. [sent-280, score-0.502]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('comparative', 0.502), ('summary', 0.241), ('concepts', 0.21), ('summarization', 0.195), ('opjk', 0.188), ('chile', 0.168), ('jindal', 0.153), ('concept', 0.151), ('news', 0.143), ('commonalities', 0.138), ('sik', 0.138), ('cij', 0.127), ('comparativeness', 0.126), ('comparee', 0.126), ('fifa', 0.126), ('ocij', 0.126), ('osik', 0.126), ('zhai', 0.116), ('lerman', 0.111), ('aspects', 0.102), ('representativeness', 0.102), ('comparisons', 0.095), ('haiti', 0.094), ('lpcm', 0.094), ('ocsijk', 0.094), ('ujk', 0.094), ('topics', 0.085), ('wan', 0.072), ('summaries', 0.07), ('crm', 0.063), ('idfij', 0.063), ('tfij', 0.063), ('descriptions', 0.061), ('cp', 0.061), ('wij', 0.061), ('kim', 0.061), ('optimization', 0.06), ('evidences', 0.059), ('rouge', 0.057), ('player', 0.057), ('liu', 0.057), ('topic', 0.056), ('jianguo', 0.055), ('lik', 0.055), ('ncm', 0.055), ('comparison', 0.053), ('weather', 0.048), ('comparable', 0.048), ('representative', 0.047), ('year', 0.046), ('conveys', 0.045), ('peking', 0.044), ('gillick', 0.044), ('sentences', 0.043), ('programming', 0.043), ('mcdonald', 0.042), ('xiaojun', 0.042), ('aspect', 0.042), ('bing', 0.042), ('si', 0.041), ('maximize', 0.041), ('implicitly', 0.04), ('cf', 0.039), ('document', 0.039), ('formulate', 0.039), ('acm', 0.038), ('di', 0.038), ('ilp', 0.038), ('objective', 0.038), ('articles', 0.037), ('carbonell', 0.037), ('cr', 0.036), ('pedersen', 0.036), ('nitin', 0.036), ('ci', 0.036), ('sun', 0.035), ('xiao', 0.034), ('richer', 0.032), ('chengxiang', 0.032), ('differences', 0.031), ('indicating', 0.03), ('relationships', 0.03), ('sw', 0.03), ('proceeding', 0.03), ('china', 0.03), ('calculated', 0.029), ('highlight', 0.029), ('latent', 0.029), ('trends', 0.028), ('formally', 0.028), ('huang', 0.028), ('solve', 0.028), ('mining', 0.028), ('wang', 0.028), ('study', 0.028), ('tbhlee', 0.028), ('tche', 0.028), ('jianwu', 0.028), ('korbinian', 0.028), ('accidents', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

2 0.36816117 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

3 0.17157143 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

Abstract: Extractive methods for multi-document summarization are mainly governed by information overlap, coherence, and content constraints. We present an unsupervised probabilistic approach to model the hidden abstract concepts across documents as well as the correlation between these concepts, to generate topically coherent and non-redundant summaries. Based on human evaluations our models generate summaries with higher linguistic quality in terms of coherence, readability, and redundancy compared to benchmark systems. Although our system is unsupervised and optimized for topical coherence, we achieve a 44.1 ROUGE on the DUC-07 test set, roughly in the range of state-of-the-art supervised models.

4 0.17093836 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

Author: Dong Wang ; Yang Liu

Abstract: This paper presents a pilot study of opinion summarization on conversations. We create a corpus containing extractive and abstractive summaries of speaker’s opinion towards a given topic using 88 telephone conversations. We adopt two methods to perform extractive summarization. The first one is a sentence-ranking method that linearly combines scores measured from different aspects including topic relevance, subjectivity, and sentence importance. The second one is a graph-based method, which incorporates topic and sentiment information, as well as additional information about sentence-to-sentence relations extracted based on dialogue structure. Our evaluation results show that both methods significantly outperform the baseline approach that extracts the longest utterances. In particular, we find that incorporating dialogue structure in the graph-based method contributes to the improved system performance.

5 0.15144193 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

Author: Xiaojun Wan

Abstract: Cross-language document summarization is defined as the task of producing a summary in a target language (e.g. Chinese) for a set of documents in a source language (e.g. English). Existing methods for addressing this task make use of either the information from the original documents in the source language or the information from the translated documents in the target language. In this study, we propose to use the bilingual information from both the source and translated documents for this task. Two summarization methods (SimFusion and CoRank) are proposed to leverage the bilingual information in the graph-based ranking framework for cross-language summary extraction. Experimental results on the DUC2001 dataset with manually translated reference Chinese summaries show the effectiveness of the proposed methods. 1

6 0.14218895 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

7 0.14018549 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

8 0.13762736 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

9 0.12890415 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

10 0.1259902 187 acl-2011-Jointly Learning to Extract and Compress

11 0.11139359 45 acl-2011-Aspect Ranking: Identifying Important Product Aspects from Online Consumer Reviews

12 0.10614394 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

13 0.10199249 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

14 0.10071108 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

15 0.099790245 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

16 0.096468359 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

17 0.095433734 4 acl-2011-A Class of Submodular Functions for Document Summarization

18 0.092528462 150 acl-2011-Hierarchical Text Classification with Latent Concepts

19 0.08826264 52 acl-2011-Automatic Labelling of Topic Models

20 0.085887082 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, 0.107), (2, -0.037), (3, 0.128), (4, -0.05), (5, -0.043), (6, -0.105), (7, 0.219), (8, 0.009), (9, -0.058), (10, -0.103), (11, 0.003), (12, -0.135), (13, -0.066), (14, -0.126), (15, -0.029), (16, 0.071), (17, 0.018), (18, 0.054), (19, 0.073), (20, -0.027), (21, -0.038), (22, 0.068), (23, 0.015), (24, -0.028), (25, -0.052), (26, 0.09), (27, -0.127), (28, 0.024), (29, 0.046), (30, 0.017), (31, -0.046), (32, 0.011), (33, -0.007), (34, -0.032), (35, 0.002), (36, 0.025), (37, 0.012), (38, -0.036), (39, 0.015), (40, 0.083), (41, 0.015), (42, -0.059), (43, -0.099), (44, -0.057), (45, -0.149), (46, -0.064), (47, 0.13), (48, 0.094), (49, -0.071)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96806544 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

2 0.77423167 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

3 0.73195064 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

Author: Samuel Brody ; Paul Kantor

Abstract: Common approaches to assessing document quality look at shallow aspects, such as grammar and vocabulary. For many real-world applications, deeper notions of quality are needed. This work represents a first step in a project aimed at developing computational methods for deep assessment of quality in the domain of intelligence reports. We present an automated system for ranking intelligence reports with regard to coverage of relevant material. The system employs methodologies from the field of automatic summarization, and achieves performance on a par with human judges, even in the absence of the underlying information sources.

4 0.71581161 187 acl-2011-Jointly Learning to Extract and Compress

Author: Taylor Berg-Kirkpatrick ; Dan Gillick ; Dan Klein

Abstract: We learn a joint model of sentence extraction and compression for multi-document summarization. Our model scores candidate summaries according to a combined linear model whose features factor over (1) the n-gram types in the summary and (2) the compressions used. We train the model using a marginbased objective whose loss captures end summary quality. Because of the exponentially large set of candidate summaries, we use a cutting-plane algorithm to incrementally detect and add active constraints efficiently. Inference in our model can be cast as an ILP and thereby solved in reasonable time; we also present a fast approximation scheme which achieves similar performance. Our jointly extracted and compressed summaries outperform both unlearned baselines and our learned extraction-only system on both ROUGE and Pyramid, without a drop in judged linguistic quality. We achieve the highest published ROUGE results to date on the TAC 2008 data set.

5 0.70421582 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

Author: Charles Greenbacker

Abstract: We propose a framework for generating an abstractive summary from a semantic model of a multimodal document. We discuss the type of model required, the means by which it can be constructed, how the content of the model is rated and selected, and the method of realizing novel sentences for the summary. To this end, we introduce a metric called information density used for gauging the importance of content obtained from text and graphical sources.

6 0.67250884 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

7 0.65860701 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

8 0.65368444 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

9 0.63913929 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

10 0.62762874 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

11 0.62242889 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice

12 0.58475757 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

13 0.5577724 150 acl-2011-Hierarchical Text Classification with Latent Concepts

14 0.50543153 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

15 0.49205881 4 acl-2011-A Class of Submodular Functions for Document Summarization

16 0.48309138 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

17 0.47449046 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

18 0.42346215 14 acl-2011-A Hierarchical Model of Web Summaries

19 0.42203206 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

20 0.42087549 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.019), (17, 0.039), (22, 0.185), (26, 0.031), (37, 0.104), (39, 0.043), (41, 0.052), (44, 0.03), (53, 0.012), (55, 0.023), (59, 0.042), (72, 0.071), (91, 0.035), (96, 0.201)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86467159 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

2 0.80064726 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

Author: Jason Naradowsky ; Kristina Toutanova

Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.

3 0.79796094 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond

Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.

4 0.79390681 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

Author: Aditya Joshi ; Balamurali AR ; Pushpak Bhattacharyya ; Rajat Mohanty

Abstract: Social networking and micro-blogging sites are stores of opinion-bearing content created by human users. We describe C-Feel-It, a system which can tap opinion content in posts (called tweets) from the micro-blogging website, Twitter. This web-based system categorizes tweets pertaining to a search string as positive, negative or objective and gives an aggregate sentiment score that represents a sentiment snapshot for a search string. We present a qualitative evaluation of this system based on a human-annotated tweet corpus.

5 0.7938388 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition

Author: Stefan Rud ; Massimiliano Ciaramita ; Jens Muller ; Hinrich Schutze

Abstract: We use search engine results to address a particularly difficult cross-domain language processing task, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in the search results as additional information for correctly classifying the token. We achieve strong gains in NER performance on news, in-domain and out-of-domain, and on web queries.

6 0.79313642 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

7 0.79255211 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

8 0.7906307 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

9 0.78813875 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

10 0.78765017 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

11 0.78740501 34 acl-2011-An Algorithm for Unsupervised Transliteration Mining with an Application to Word Alignment

12 0.78711128 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

13 0.78704405 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

14 0.7866472 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

15 0.78657216 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

16 0.78649557 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

17 0.78629851 221 acl-2011-Model-Based Aligner Combination Using Dual Decomposition

18 0.7856288 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts

19 0.78556514 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text

20 0.78544891 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words