acl acl2011 acl2011-326 knowledge-graph by maker-knowledge-mining

326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

Source: pdf

Author: Xiaojun Wan

Abstract: Cross-language document summarization is defined as the task of producing a summary in a target language (e.g. Chinese) for a set of documents in a source language (e.g. English). Existing methods for addressing this task make use of either the information from the original documents in the source language or the information from the translated documents in the target language. In this study, we propose to use the bilingual information from both the source and translated documents for this task. Two summarization methods (SimFusion and CoRank) are proposed to leverage the bilingual information in the graph-based ranking framework for cross-language summary extraction. Experimental results on the DUC2001 dataset with manually translated reference Chinese summaries show the effectiveness of the proposed methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn cs Abstract Cross-language document summarization is defined as the task of producing a summary in a target language (e. [sent-4, score-0.632]

2 Existing methods for addressing this task make use of either the information from the original documents in the source language or the information from the translated documents in the target language. [sent-9, score-0.21]

3 In this study, we propose to use the bilingual information from both the source and translated documents for this task. [sent-10, score-0.218]

4 Two summarization methods (SimFusion and CoRank) are proposed to leverage the bilingual information in the graph-based ranking framework for cross-language summary extraction. [sent-11, score-0.523]

5 Experimental results on the DUC2001 dataset with manually translated reference Chinese summaries show the effectiveness of the proposed methods. [sent-12, score-0.265]

6 1 Introduction Cross-language document summarization is defined as the task of producing a summary in a different target language for a set of documents in a source language (Wan et al. [sent-13, score-0.601]

7 In this study, we focus on English-to-Chinese cross-language summarization, which aims to produce Chinese summaries for English document sets. [sent-15, score-0.294]

8 For example, it is beneficial for most Chinese readers to quickly browse and understand 1546 English news documents or document sets by reading the corresponding Chinese summaries. [sent-17, score-0.22]

9 In particular, for the task of English-to-Chinese cross-language summarization, one method is to directly extract English summary sentences based on English features extracted from the English documents, and then automatically translate the English summary sentences into Chinese summary sentences. [sent-19, score-0.775]

10 The other method is to automatically translate the English sentences into Chi- nese sentences, and then directly extract Chinese summary sentences based on Chinese features. [sent-20, score-0.417]

11 However, it is not very reliable to use only the information in one language, because the machine translation quality is far from satisfactory, and thus the translated Chinese sentences usually contain some errors and noises. [sent-22, score-0.277]

12 ” is automatically translated into the Chinese sentence “许许多破坏电源线被认为是保险的，因为是连根拔起的树木和灌木，在广泛的领域。 ” by using Google Translate1 , but the Chinese sentence contains a few translation errors. [sent-24, score-0.254]

13 On the other side, if we rely only on the Chinese-side information to extract Chinese summary sentences, we cannot guarantee that the selected sentences are really salient because the features for sentence ranking based on the incorrectly translated sentences are not very reliable, either. [sent-32, score-0.694]

14 In this study, we propose to leverage both the information in the source language and the information in the target language for cross-language document summarization. [sent-33, score-0.16]

15 In particular, we propose two graph-based summarization methods (SimFusion and CoRank) for using both Englishside and Chinese-side information in the task of English-to-Chinese cross-document summarization. [sent-34, score-0.22]

16 The SimFusion method linearly fuses the Englishside similarity and the Chinese-side similarity for measuring Chinese sentence similarity. [sent-35, score-0.217]

17 The CoRank method adopts a co-ranking algorithm to simultaneously rank both English sentences and Chinese sentences by incorporating mutual influences between them. [sent-36, score-0.306]

18 We use the DUC2001 dataset with manually translated reference Chinese summaries for evaluation. [sent-37, score-0.265]

19 1 Related Work General Document Summarization Document tion-based, We focus study, and summarization methods can be extracabstraction-based or hybrid methods. [sent-47, score-0.22]

20 on extraction-based methods in this the methods directly extract summary sentences from a document or document set by ranking the sentences in the document or document set. [sent-48, score-0.985]

21 In the task of single document summarization, various features have been investigated for ranking sentences in a document, including term frequency, sentence position, cue words, stigma words, and topic signature (Luhn 1969; Lin and Hovy, 2000). [sent-49, score-0.385]

22 (2010) present a language-independent approach for extractive summarization based on the linear optimization of several sentence ranking measures using a genetic algorithm. [sent-53, score-0.389]

23 , 2004) ranks the sentences in a document set based on such features as cluster centroids, position and TFIDF. [sent-58, score-0.233]

24 Nenkova and Louis (2008) investigate the influences of input difficulty on summarization performance. [sent-61, score-0.252]

25 Celikyilmaz and Hakkani-Tur (2010) formulate extractive summarization as a two-step learning problem by building a generative model for pattern discovery and a regression model for inference. [sent-64, score-0.264]

26 (2010) propose an A* search algorithm to find the best extractive summary up to a given length, and they propose a discriminative training algorithm for directly maximizing the quality of the best summary. [sent-66, score-0.223]

27 Graph-based methods have also been used to rank sentences for multi-document summarization (Mihalcea and Tarau, 2005; Wan and Yang, 2008). [sent-67, score-0.32]

28 2 Cross-Lingual Document Summarization Several pilot studies have investigated the task of cross-language document summarization. [sent-69, score-0.16]

29 Two typical translation schemes are document translation or summary translation. [sent-71, score-0.418]

30 The document translation scheme first translates the source documents into the corresponding documents in the target language, and then extracts summary sentences based only on the information on the target side. [sent-72, score-0.576]

31 The summary translation scheme first extracts summary sentences from the source documents based only on the information on the source side, and then translates the summary sentences into the corresponding summary sentences in the target language. [sent-73, score-1.165]

32 (2004) propose to generate a Japanese summary by using Korean summarizer. [sent-77, score-0.179]

33 Orasan and Chiorean (2008) propose to produce summaries with the MMR method from Romanian news articles and then automatically translate the summaries into English. [sent-80, score-0.284]

34 Cross language query based summarization has been investigated in (Pingali et al. [sent-81, score-0.247]

35 (2010) adopt the summary translation scheme for the task of English-to-Chinese cross-language summarization. [sent-84, score-0.232]

36 They first extract English summary sentences by using English-side features and the machine translation quality factor, and then automatically translate the English summary into Chinese summary. [sent-85, score-0.549]

37 Other related work includes multilingual summarization (Lin et al. [sent-86, score-0.257]

38 , 2005; Siddharthan and McKeown, 2005), which aims to create summaries from multiple sources in multiple languages. [sent-87, score-0.161]

39 In other words, when we compute the similarity value between two Chinese sentences, the similarity value between the corresponding two English sentences is used by linear fusion. [sent-93, score-0.322]

40 Since sentence similarity evaluation plays a very important role in the graph-based ranking algorithm, this method can leverage bothside information through similarity fusion. [sent-94, score-0.291]

41 Formally, given the Chinese document set Dcn translated from an English document set, let Gcn=(Vcn, Ecn) be an undirected graph to reflect the relationships between the sentences in the Chinese document set. [sent-95, score-0.736]

42 Vcn is the set of vertices and each vertex scni in Vcn represents a Chinese sentence. [sent-96, score-0.294]

43 Each edge ecnij in Ecn is associated with an affinity weight f(scni, scnj) between sentences scni and scnj (i≠j). [sent-98, score-0.689]

44 The weight is computed by linearly combining the similarity value simcosine(scni, scnj) between the Chinese sentences and the similarity value simcosine(seni, senj) between the corresponding English sentences. [sent-99, score-0.322]

45 f(sicn ,sjcn ) = λ simcosine (sicn ,sjcn ) + (1 λ) ⋅ simcosine (sien ,sjen ) λ⋅ − where senj and seni are the source English sentences for scnj and scni. [sent-100, score-0.833]

46 We use an affinity matrix Mcn to describe Gcn with each entry corresponding to the weight of an edge in the graph. [sent-106, score-0.226]

47 (2006) to penalize the sentences highly overlapping with other highly scored sentences, and fi- nally the salient and novel Chinese sentences are directly selected as summary sentences. [sent-116, score-0.47]

48 The source English sentences and the translated Chinese sentences are simultaneously ranked in a unified graph-based algorithm. [sent-119, score-0.356]

49 The saliency of each English sentence relies not only on the English sentences linked with it, but also on the Chinese sentences linked with it. [sent-120, score-0.539]

50 Similarly, the saliency of each Chinese sentence relies not only on the Chinese sentences linked with it, but also on the English sentences linked with it. [sent-121, score-0.539]

51 More specifically, the proposed method is based on the following assumptions: Assumption 1: A Chinese sentence would be salient if it is heavily linked with other salient Chinese sentences; and an English sentence would be salient if it is heavily linked with other salient English sentences. [sent-122, score-0.658]

52 Assumption 2: A Chinese sentence would be salient if it is heavily linked with salient English sentences; and an English sentence would be salient if it is heavily linked with salient Chinese sentences. [sent-123, score-0.658]

53 The first assumption is similar to PageRank which makes use of mutual “recommendations” between the sentences in the same language to rank sentences. [sent-124, score-0.137]

54 The second assumption is similar to HITS if the English sentences and the Chinese sentences are considered as authorities and hubs, respectively. [sent-125, score-0.2]

55 The mutual influences between 1549 the Chinese sentences and the English sentences are incorporated in the method. [sent-127, score-0.269]

56 Three kinds of relationships are exploited: the CN-CN relationships between Chinese sentences, the EN-EN relationships between English sentences, and the EN-CN relationships between English sentences and Chinese sentences. [sent-129, score-0.505]

57 Formally, given an English document set Den and the translated Chinese document set Dcn, let G=(Ven, Vcn, Een, Ecn, Eencn) be an undirected graph to reflect all the three kinds of relationships between the sentences in the two document sets. [sent-130, score-0.765]

58 scni is the correspond- ing Chinese sentence translated from seni. [sent-133, score-0.444]

59 Een is the edge set to reflect the relationships between the English sentences. [sent-135, score-0.172]

60 Ecn is the edge set to reflect the relationships between the Chinese sentences. [sent-136, score-0.172]

61 Eencn is the edge set to reflect the relationships between the English sentences and the Chinese sentences. [sent-137, score-0.272]

62 Based on the graph representation, we compute the following three affinity matrices to reflect the three kinds of sentence relationships: Chinese sentences Figure 1. [sent-138, score-0.328]

63 The three kinds of sentence relationships 1) Mcn=(Mcnij)n×n: This affinity matrix aims to reflect the relationships between the Chinese sentences. [sent-139, score-0.511]

64 Each entry in the matrix corresponds to the cosine similarity between the two Chinese sentences. [sent-140, score-0.218]

65 2) Men=(Meni,j)n×n: This affinity matrix aims to reflect the relationships between the English sentences. [sent-142, score-0.337]

66 Each entry in the matrix corresponds to the cosine similarity between the two English sentences. [sent-143, score-0.218]

67 3) Mencn=(Mencnij)n×n: This affinity matrix aims to reflect the relationships between the English sentences and the Chinese sentences. [sent-145, score-0.437]

68 Each entry Mencnij in the matrix corresponds to the similarity between the English sentence seni and the Chinese sentence scnj. [sent-146, score-0.41]

69 It is hard to directly compute the similarity between the sentences in different languages. [sent-147, score-0.183]

70 We use two column vectors u=[u(scni)]n×1 and v =[v(senj)]n×1 to denote the saliency scores of the Chinese sentences and the English sentences, respectively. [sent-152, score-0.22]

71 Finally, a few highly ranked sentences are selected as summary sentences. [sent-158, score-0.279]

72 1 Experimental Evaluation Evaluation Setup There is no benchmark dataset for English-toChinese cross-language document summarization, so we built our evaluation dataset based on the DUC2001 dataset by manually translating the reference summaries. [sent-160, score-0.176]

73 DUC2001 provided 30 English document sets for generic multi-document summarization. [sent-161, score-0.133]

74 The average document number per document set was 10. [sent-162, score-0.266]

75 The sentences in each article have been separated and the sentence information has been stored into files. [sent-163, score-0.151]

76 Three or two generic reference English summaries were provided by NIST annotators for each document set. [sent-164, score-0.299]

77 Three graduate students were employed to manually translate the reference English summaries into reference Chinese summaries. [sent-165, score-0.247]

78 Each student manually translated one third of the reference summaries. [sent-166, score-0.142]

79 It was much easier and more reliable to provide the reference Chinese summaries by manual translation than by manual summarization. [sent-167, score-0.244]

80 e0 Gr7 a81E64g0-7eS54_UF All the English sentences in the document set were automatically translated into Chinese sentences by using Google Translate, and the Stanford Chinese Word Segmenter2 was used for segmenting the Chinese documents and summaries into words. [sent-172, score-0.597]

81 For comparative study, the summary length was limited to five sentences, i. [sent-173, score-0.179]

82 5 (Lin and Hovy, 2003) toolkit for evaluation, which has been widely adopted by DUC and TAC for automatic summarization evaluation. [sent-178, score-0.22]

83 It measured summary quality by counting overlapping units such as the n-gram, word sequences and word pairs between the candidate summary and the reference summary. [sent-179, score-0.401]

84 Baseline(EN): This baseline adopts the summary translation scheme, and it relies on the English-side information for English sentence ranking. [sent-184, score-0.384]

85 The extracted English summary is finally automatically translated into the corresponding Chinese summary. [sent-185, score-0.278]

86 The same sentence ranking algorithm with the SimFusion method is adopted, and the affinity weight is computed based only on the cosine similarity between English sentences. [sent-186, score-0.359]

87 Baseline(CN): This baseline adopts the document translation scheme, and it relies on the Chinese-side information for Chinese sentence ranking. [sent-187, score-0.338]

88 The Chinese summary sentences are directly extracted from the translated Chinese documents. [sent-188, score-0.378]

89 The same sentence ranking algorithm with the SimFusion method is adopted, and the affinity 2 http://nlp. [sent-189, score-0.229]

90 The results demonstrate that the Chinese-side information is more beneficial than the English-side information for cross-document summarization, because the summary sentences are finally selected from the Chinese side. [sent-199, score-0.324]

91 The results demonstrate the effectiveness of using bilingual information for cross-language document summarization. [sent-201, score-0.183]

92 The results show that the CoRank method is more suitable for the task by incorporating the bilingual information into a unified ranking framework. [sent-204, score-0.154]

93 The results demonstrate that CoRank relies on both the information from the same language side and the information from the other language side for sentence ranking. [sent-220, score-0.155]

94 Therefore, both the Chinese-side information and the English-side information can complement each other, and they are beneficial to the final summarization performance. [sent-221, score-0.265]

95 The bilingual information can be better incorporated in the unified ranking framework of the CoRank method. [sent-224, score-0.154]

96 Though our attempt to use GIZA++ for evaluating the similarity between Chinese sentences and English sentences failed, we will exploit more advanced measures based on statistical alignment model for cross-language similarity computation. [sent-231, score-0.366]

97 Crosslingual summarization with thematic extraction, syntac- tic sentence simplification, and bilingual generation. [sent-288, score-0.321]

98 A new approach to improving multilingual summarization using a genetic algorithm. [sent-369, score-0.257]

99 The Pyramid method: incorporating human content selection variation in summarization evaluation. [sent-401, score-0.22]

100 Cross-language document summarization based on machine translation quality prediction. [sent-445, score-0.406]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('corank', 0.392), ('simfusion', 0.333), ('chinese', 0.329), ('scni', 0.294), ('summarization', 0.22), ('summary', 0.179), ('scnj', 0.157), ('simcosine', 0.157), ('vcn', 0.157), ('seni', 0.137), ('document', 0.133), ('wan', 0.124), ('summaries', 0.123), ('saliency', 0.12), ('mcn', 0.118), ('affinity', 0.104), ('sicn', 0.104), ('cn', 0.1), ('sentences', 0.1), ('translated', 0.099), ('ecn', 0.098), ('mencn', 0.098), ('senj', 0.098), ('relationships', 0.094), ('salient', 0.091), ('similarity', 0.083), ('corankbaseline', 0.078), ('ranking', 0.074), ('english', 0.073), ('linked', 0.066), ('rouge', 0.059), ('mcnij', 0.059), ('mencnij', 0.059), ('sjen', 0.059), ('matrix', 0.057), ('translation', 0.053), ('englishside', 0.052), ('en', 0.052), ('sentence', 0.051), ('bilingual', 0.05), ('cosine', 0.047), ('beneficial', 0.045), ('figures', 0.045), ('pagerank', 0.045), ('reflect', 0.044), ('extractive', 0.044), ('reference', 0.043), ('documents', 0.042), ('multidocument', 0.042), ('reinforcement', 0.041), ('chalendar', 0.039), ('dcn', 0.039), ('eencn', 0.039), ('encn', 0.039), ('gcn', 0.039), ('infoscore', 0.039), ('scin', 0.039), ('sejn', 0.039), ('sien', 0.039), ('sjcn', 0.039), ('translate', 0.038), ('aims', 0.038), ('multilingual', 0.037), ('mutual', 0.037), ('adopts', 0.037), ('mihalcea', 0.036), ('relies', 0.036), ('orasan', 0.035), ('aker', 0.035), ('fusing', 0.035), ('amini', 0.035), ('noises', 0.035), ('litvak', 0.035), ('side', 0.034), ('edge', 0.034), ('kupiec', 0.032), ('tarau', 0.032), ('influences', 0.032), ('entry', 0.031), ('nenkova', 0.031), ('radev', 0.03), ('pingali', 0.03), ('louis', 0.03), ('leuski', 0.03), ('ven', 0.03), ('heavily', 0.03), ('lin', 0.03), ('unified', 0.03), ('kinds', 0.029), ('normalized', 0.029), ('siddharthan', 0.028), ('baseline', 0.028), ('value', 0.028), ('source', 0.027), ('investigated', 0.027), ('ctio', 0.027), ('peking', 0.027), ('celikyilmaz', 0.026), ('erkan', 0.026), ('reliable', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999934 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

Author: Xiaojun Wan

2 0.18188195 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng

Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1

3 0.17869842 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

Author: Dong Wang ; Yang Liu

Abstract: This paper presents a pilot study of opinion summarization on conversations. We create a corpus containing extractive and abstractive summaries of speaker’s opinion towards a given topic using 88 telephone conversations. We adopt two methods to perform extractive summarization. The first one is a sentence-ranking method that linearly combines scores measured from different aspects including topic relevance, subjectivity, and sentence importance. The second one is a graph-based method, which incorporates topic and sentiment information, as well as additional information about sentence-to-sentence relations extracted based on dialogue structure. Our evaluation results show that both methods significantly outperform the baseline approach that extracts the longest utterances. In particular, we find that incorporating dialogue structure in the graph-based method contributes to the improved system performance.

4 0.15144193 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

Abstract: Comparative News Summarization aims to highlight the commonalities and differences between two comparable news topics. In this study, we propose a novel approach to generating comparative news summaries. We formulate the task as an optimization problem of selecting proper sentences to maximize the comparativeness within the summary and the representativeness to both news topics. We consider semantic-related cross-topic concept pairs as comparative evidences, and consider topic-related concepts as representative evidences. The optimization problem is addressed by using a linear programming model. The experimental results demonstrate the effectiveness of our proposed model.

5 0.14741349 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

Abstract: Extractive methods for multi-document summarization are mainly governed by information overlap, coherence, and content constraints. We present an unsupervised probabilistic approach to model the hidden abstract concepts across documents as well as the correlation between these concepts, to generate topically coherent and non-redundant summaries. Based on human evaluations our models generate summaries with higher linguistic quality in terms of coherence, readability, and redundancy compared to benchmark systems. Although our system is unsupervised and optimized for topical coherence, we achieve a 44.1 ROUGE on the DUC-07 test set, roughly in the range of state-of-the-art supervised models.

6 0.14390846 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

7 0.13956283 187 acl-2011-Jointly Learning to Extract and Compress

8 0.13478602 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

9 0.13306893 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

10 0.12415897 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

11 0.12329997 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

12 0.12056258 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

13 0.11682552 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

14 0.11539107 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

15 0.10511693 4 acl-2011-A Class of Submodular Functions for Document Summarization

16 0.099154197 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

17 0.09290617 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice

18 0.092132524 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

19 0.091098897 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

20 0.089804374 66 acl-2011-Chinese sentence segmentation as comma classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, 0.038), (2, -0.007), (3, 0.134), (4, -0.063), (5, -0.05), (6, -0.072), (7, 0.175), (8, 0.08), (9, -0.023), (10, -0.095), (11, -0.024), (12, -0.161), (13, -0.11), (14, -0.15), (15, -0.018), (16, 0.074), (17, -0.0), (18, 0.204), (19, 0.149), (20, -0.027), (21, -0.067), (22, 0.034), (23, 0.058), (24, 0.002), (25, -0.018), (26, -0.046), (27, -0.034), (28, 0.186), (29, -0.032), (30, 0.009), (31, 0.003), (32, 0.049), (33, -0.057), (34, -0.014), (35, -0.052), (36, 0.048), (37, -0.039), (38, 0.007), (39, 0.003), (40, 0.012), (41, 0.069), (42, 0.047), (43, 0.015), (44, -0.04), (45, -0.012), (46, -0.065), (47, 0.043), (48, 0.065), (49, 0.116)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96193641 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

Author: Xiaojun Wan

2 0.69701225 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

Author: Charles Greenbacker

Abstract: We propose a framework for generating an abstractive summary from a semantic model of a multimodal document. We discuss the type of model required, the means by which it can be constructed, how the content of the model is rated and selected, and the method of realizing novel sentences for the summary. To this end, we introduce a metric called information density used for gauging the importance of content obtained from text and graphical sources.

3 0.69077319 76 acl-2011-Comparative News Summarization Using Linear Programming

Author: Xiaojiang Huang ; Xiaojun Wan ; Jianguo Xiao

4 0.66159731 187 acl-2011-Jointly Learning to Extract and Compress

Author: Taylor Berg-Kirkpatrick ; Dan Gillick ; Dan Klein

Abstract: We learn a joint model of sentence extraction and compression for multi-document summarization. Our model scores candidate summaries according to a combined linear model whose features factor over (1) the n-gram types in the summary and (2) the compressions used. We train the model using a marginbased objective whose loss captures end summary quality. Because of the exponentially large set of candidate summaries, we use a cutting-plane algorithm to incrementally detect and add active constraints efficiently. Inference in our model can be cast as an ILP and thereby solved in reasonable time; we also present a fast approximation scheme which achieves similar performance. Our jointly extracted and compressed summaries outperform both unlearned baselines and our learned extraction-only system on both ROUGE and Pyramid, without a drop in judged linguistic quality. We achieve the highest published ROUGE results to date on the TAC 2008 data set.

5 0.65136427 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

Author: Nitin Agarwal ; Ravi Shankar Reddy ; Kiran GVR ; Carolyn Penstein Rose

Abstract: In this demo, we present SciSumm, an interactive multi-document summarization system for scientific articles. The document collection to be summarized is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each article based on queries generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. SciSumm is currently built over the 2008 ACL Anthology, however the gen- eralizable nature of the summarization techniques and the extensible architecture makes it possible to use the system with other corpora where a citation network is available. Evaluation results on the same corpus demonstrate that our system performs better than an existing widely used multi-document summarization system (MEAD).

6 0.62135965 66 acl-2011-Chinese sentence segmentation as comma classification

7 0.60916668 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice

8 0.60377949 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

9 0.59077764 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

10 0.57813859 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

11 0.56843984 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

12 0.56246668 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

13 0.55913013 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

14 0.5506286 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

15 0.53948283 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

16 0.52775013 4 acl-2011-A Class of Submodular Functions for Document Summarization

17 0.50373316 51 acl-2011-Automatic Headline Generation using Character Cross-Correlation

18 0.47803226 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

19 0.47522283 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

20 0.44954333 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.012), (17, 0.041), (26, 0.026), (37, 0.091), (39, 0.035), (41, 0.058), (55, 0.022), (57, 0.02), (59, 0.029), (60, 0.283), (72, 0.029), (91, 0.037), (96, 0.217)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87671101 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

Author: Saif Mohammad

Abstract: Colour is a key component in the successful dissemination of information. Since many real-world concepts are associated with colour, for example danger with red, linguistic information is often complemented with the use of appropriate colours in information visualization and product marketing. Yet, there is no comprehensive resource that captures concept–colour associations. We present a method to create a large word–colour association lexicon by crowdsourcing. A wordchoice question was used to obtain sense-level annotations and to ensure data quality. We focus especially on abstract concepts and emotions to show that even they tend to have strong colour associations. Thus, using the right colours can not only improve semantic coherence, but also inspire the desired emotional response.

same-paper 2 0.78108072 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

Author: Xiaojun Wan

3 0.70926297 57 acl-2011-Bayesian Word Alignment for Statistical Machine Translation

Author: Coskun Mermer ; Murat Saraclar

Abstract: In this work, we compare the translation performance of word alignments obtained via Bayesian inference to those obtained via expectation-maximization (EM). We propose a Gibbs sampler for fully Bayesian inference in IBM Model 1, integrating over all possible parameter values in finding the alignment distribution. We show that Bayesian inference outperforms EM in all of the tested language pairs, domains and data set sizes, by up to 2.99 BLEU points. We also show that the proposed method effectively addresses the well-known rare word problem in EM-estimated models; and at the same time induces a much smaller dictionary of bilingual word-pairs. .t r

4 0.68820685 292 acl-2011-Target-dependent Twitter Sentiment Classification

Author: Long Jiang ; Mo Yu ; Ming Zhou ; Xiaohua Liu ; Tiejun Zhao

Abstract: Sentiment analysis on Twitter data has attracted much attention recently. In this paper, we focus on target-dependent Twitter sentiment classification; namely, given a query, we classify the sentiments of the tweets as positive, negative or neutral according to whether they contain positive, negative or neutral sentiments about that query. Here the query serves as the target of the sentiments. The state-ofthe-art approaches for solving this problem always adopt the target-independent strategy, which may assign irrelevant sentiments to the given target. Moreover, the state-of-the-art approaches only take the tweet to be classified into consideration when classifying the sentiment; they ignore its context (i.e., related tweets). However, because tweets are usually short and more ambiguous, sometimes it is not enough to consider only the current tweet for sentiment classification. In this paper, we propose to improve target-dependent Twitter sentiment classification by 1) incorporating target-dependent features; and 2) taking related tweets into consideration. According to the experimental results, our approach greatly improves the performance of target-dependent sentiment classification. 1

5 0.66403401 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation

Author: Nan Duan ; Mu Li ; Ming Zhou

Abstract: This paper presents hypothesis mixture decoding (HM decoding), a new decoding scheme that performs translation reconstruction using hypotheses generated by multiple translation systems. HM decoding involves two decoding stages: first, each component system decodes independently, with the explored search space kept for use in the next step; second, a new search space is constructed by composing existing hypotheses produced by all component systems using a set of rules provided by the HM decoder itself, and a new set of model independent features are used to seek the final best translation from this new search space. Few assumptions are made by our approach about the underlying component systems, enabling us to leverage SMT models based on arbitrary paradigms. We compare our approach with several related techniques, and demonstrate significant BLEU improvements in large-scale Chinese-to-English translation tasks.

6 0.663499 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

7 0.66226619 117 acl-2011-Entity Set Expansion using Topic information

8 0.66173875 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

9 0.66143352 220 acl-2011-Minimum Bayes-risk System Combination

10 0.66126758 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

11 0.66125357 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

12 0.66122484 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

13 0.66033041 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation

14 0.65995681 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation

15 0.65963662 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

16 0.65912163 101 acl-2011-Disentangling Chat with Local Coherence Models

17 0.65899336 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

18 0.65897113 28 acl-2011-A Statistical Tree Annotator and Its Applications

19 0.65892977 177 acl-2011-Interactive Group Suggesting for Twitter

20 0.65874851 76 acl-2011-Comparative News Summarization Using Linear Programming