acl acl2011 acl2011-98 knowledge-graph by maker-knowledge-mining

98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

Source: pdf

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

Abstract: Extractive methods for multi-document summarization are mainly governed by information overlap, coherence, and content constraints. We present an unsupervised probabilistic approach to model the hidden abstract concepts across documents as well as the correlation between these concepts, to generate topically coherent and non-redundant summaries. Based on human evaluations our models generate summaries with higher linguistic quality in terms of coherence, readability, and redundancy compared to benchmark systems. Although our system is unsupervised and optimized for topical coherence, we achieve a 44.1 ROUGE on the DUC-07 test set, roughly in the range of state-of-the-art supervised models.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We present an unsupervised probabilistic approach to model the hidden abstract concepts across documents as well as the correlation between these concepts, to generate topically coherent and non-redundant summaries. [sent-3, score-0.403]

2 Based on human evaluations our models generate summaries with higher linguistic quality in terms of coherence, readability, and redundancy compared to benchmark systems. [sent-4, score-0.278]

3 An ideal generated summary text should contain the shared relevant content among set of documents only once, plus other unique information from individual documents that are directly related to the user’s query addressing different levels of detail. [sent-8, score-0.48]

4 Recent approaches to the summarization task has somewhat focused on the redundancy and coherence issues. [sent-9, score-0.259]

5 In this paper, we introduce a series of new generative models for multiple-documents, based on a discovery of hierarchical topics and their correlations to extract topically coherent sentences. [sent-10, score-0.642]

6 , sections, paragraphs, sentences) for different levels of concepts in a hierarchy, most recent summarization work has focused on structured probabilistic models to represent the corpus concepts (Barzilay et al. [sent-16, score-0.416]

7 In particular (Haghighi and Vanderwende, 2009; Celikyilmaz and HakkaniTur, 2010) build hierarchical topic models to identify salient sentences that contain abstract concepts rather than specific concepts. [sent-21, score-0.647]

8 In this paper, we present a novel, fully generative Bayesian model of document corpus, which can discover topically coherent sentences that contain key shared information with as little detail and redundancy as possible. [sent-25, score-0.533]

9 Our model can discover hierarchical latent structure of multi-documents, in which some words are governed by low-level topics (T) and others by high-level topics (H). [sent-26, score-0.62]

10 Human evaluations of gener- ated summaries confirm that our model can generate non-redundant and topically coherent summaries. [sent-32, score-0.383]

11 , 2006); topic signatures based on user queries (Lin and Hovy, 2002; Conroy et al. [sent-35, score-0.249]

12 Recent research focusing on the extraction of latent concepts from document clusters are close in spirit to our work (Barzilay and Lee, 2004; Daum e´III and Marcu, 2006; Eisenstein and Barzilay, 2008; Tang et al. [sent-38, score-0.278]

13 Some of these work (Haghighi and Vanderwende, 2009; Celikyilmaz and Hakkani-Tur, 2010) focus on the discovery of hierarchical concepts from documents (from abstract to specific) using extensions of hierarchal topic models (Blei et al. [sent-41, score-0.61]

14 Hierarchical concept learning models help to discover, for instance, that ”baseball” and ”football” are both contained in a general class ”sports”, so that the summaries reference terms related to more abstract concepts like ”sports”. [sent-43, score-0.35]

15 We need a model that can identify salient sentences referring to general concepts of documents and there should be minimum correlation between them. [sent-45, score-0.402]

16 We define a tiered-topic clustering in which the upper nodes in the DAG are higher-level topics H, rep- resenting common co-occurence patterns (correlations) between lower-level topics T in documents. [sent-49, score-0.438]

17 Mainly, our model can discover correlated topics to eliminate redundant sentences in summary text. [sent-51, score-0.581]

18 , 2004), in which words are generated by first selecting an author uniformly from an observed author list and then selecting a topic from a distribution over topics that is specific to that author. [sent-56, score-0.562]

19 In our model, words are generated from different topics of documents by first selecting a sentence containing the word and then topics that are specific to that sentence. [sent-57, score-0.684]

20 This way we can directly extract from documents the summary related sentences that contain high-level topics. [sent-58, score-0.314]

21 In addition in (Celikyilmaz and Hakkani-Tur, 2010), the sentences can only share topics if the sentences are represented on the same path of captured topic hierarchy, restricting topic sharing across sen- tences on different paths. [sent-59, score-0.835]

22 Our DAG identifies tiered topics distributed over document clusters that can be shared by each sentence. [sent-60, score-0.449]

23 3 Topic Coherence for Summarization In this section we discuss the main contribution, our two hierarchical mixture models, which improve summary generation performance through the use of tiered topic models. [sent-61, score-0.569]

24 Our models can identify lowerlevel topics T (concepts) defined as distributions over words or higher-level topics H, which represent correlations between these lower level topics given sentences. [sent-62, score-0.857]

25 We present our synthetic experiment for model development to evaluate extracted summaries on redundancy measure. [sent-63, score-0.308]

26 For model development we use the DUC 2005 dataset1 , which consists of 45 document clusters, each of which include 1-4 set of human generated summaries (10-15 sentences each). [sent-66, score-0.417]

27 Each document cluster consists ∼ 25 documents (25-30 sen- tences/document) s riesttrsi ∼eve 2d5 b daosceudm on a user query. [sent-67, score-0.273]

28 For the synthetic experiments, we include the provided human generated summaries of each corpus as additional documents. [sent-69, score-0.279]

29 The sentences in human summaries include general concepts mentioned in the corpus, the salient sentences of documents. [sent-70, score-0.512]

30 Contrary to usual qualitative evaluations of summarization tasks, our aim during development is to measure the percentage of sentences in a human summary that our model can identify as salient among all other document cluster sentences. [sent-71, score-0.688]

31 Because human produced summaries generally contain non-redundant sentences, we use total number of top-ranked human summary sentences as a qualitative redundancy measure in our synthetic experiments. [sent-72, score-0.554]

32 In each model, a document d is a vector of Nd words wd, where each wid is chosen from a vocabulary of size V , and a vector of sentences S, representing all sentences in a corpus of size SD. [sent-73, score-0.647]

33 We identify sentences as meta-variables of document clusters, which the generative process models both sentences and documents using tiered topics. [sent-74, score-0.483]

34 A sentence’s re- latedness to summary text is tied to the document cluster’s user query. [sent-75, score-0.312]

35 4 Two-Tiered Topic Model - TTM Our base model, the two-tiered topic model (TTM), is inspired by the hierarchical topic model, PAM, proposed by Li and McCallum (2006). [sent-77, score-0.545]

36 PAM structures documents to represent and learn arbitrary, nested, and possibly sparse topic correlations using 1www-nlpir. [sent-78, score-0.36]

37 html 493 Documents in a Document Cluster Figure 1: Graphical model depiction of two-tiered topic model (TTM) described in section §4. [sent-81, score-0.314]

38 K1 ), representing topic correlations, are modeled as distributions over lowlevel-topics (Tk2=1. [sent-88, score-0.281]

39 Our goals are not so dif- ferent: we aim to discover concepts from documents that would attribute for the general topics related to a user query, however, we want to relate this information to sentences. [sent-95, score-0.558]

40 We represent sentences S by discovery of general (more general) to specific topics (Fig. [sent-96, score-0.399]

41 Similarly, we represent summary unrelated (document specific) sentences as corpus specific distributions θ over background words wB, (functional words like prepositions, etc. [sent-98, score-0.38]

42 Our two-tiered topic model for salient sentence discovery can be generated for each word in the document (Algorithm 1) as follows: For a word wid in document d, a random variable xid is drawn, which determines if wid is query related, i. [sent-100, score-1.652]

43 , wid either exists in the query or is related to the query2. [sent-102, score-0.496]

44 Then sentence si is chosen uniformly at random (ysi∼ Uniform(si)) from sentences in the document containing wid (deterministic if there is only one sentence containing wid). [sent-104, score-0.817]

45 If a word is query/summary related sentence S, first a sentence then a high-level (H) and a low-level (T) topic is sampled. [sent-110, score-0.286]

46 Every time an si is sampled afo vre a query Sre ∈late Zd wid, we increment its count, a degree of sentence saliency. [sent-115, score-0.367]

47 Given that wid is related to a query, it is associated with two-tiered multinomial distributions: high-level H topics and low-level T topics. [sent-116, score-0.613]

48 A highlevel topic Hki is chosen first from a distribution over low-level topics T specific to that si and one low-level topic Tkj is chosen from a distribution over words, and wid is generated from the sampled low-level topic. [sent-117, score-1.4]

49 If wid is not query-related, it is generated as a background word wB. [sent-118, score-0.479]

50 A sentence sampled from a query related word is associated with a distribution over K1 number of high-level topics Hki , each of which are also associated with K2 number of low-level topics Tkj , a multinomial over lexical words of a corpus. [sent-121, score-0.72]

51 if wid exists or related to the the query then x = 1deterministic, otherwise it is stochastically assigned x Bin(Ψ). [sent-149, score-0.529]

52 , the topic ”acquisition” is found to be more correlated with ”retail” than the ”network” topic given H1. [sent-155, score-0.461]

53 For each word, xid is sampled from a sentence specific binomial ψ which in turn has a smoothing prior η to determine if the sampled word wid is (query) summary-related or document-specific. [sent-165, score-0.936]

54 Depending on xid, we either sample a sentence along with a high/low-level topic pair or just sample background words wB. [sent-166, score-0.361]

55 The probability distribution over sentence assignments, P(ysi = s|S) si ∈ S, is assumed to be uniform over the =elems |eSn)t ss of∈ S, a,n ids d ase-terministic if there is only one sentence in the document containing the corresponding word. [sent-167, score-0.298]

56 For each word we sample a high-level Hki and a low-level Tkj topic if the word is query related (xid = 1). [sent-169, score-0.356]

57 Note that the number of tiered topics in the model is fixed to K1 and K2, which is optimized with validation experiments. [sent-171, score-0.325]

58 SD: scoreTTM(sj) ∝ # [wid ∈ sj, xid = 1] /nwj (1) where wid indicates a word in a document d that exists in sj and is sampled as summary related based on random indicator variable xid. [sent-182, score-0.99]

59 We compare TTM results on synthetic experiments against PAM (Li and McCallum, 2006) a similar topic model that clusters topics in a hierarchical structure, where super-topics are distributions over sub-topics. [sent-187, score-0.733]

60 We obtain sentence scores for PAM models by calculating the sub-topic significance (TS) based on super-topic correlations, and discover topic correlations over the entire document space (corpus wide). [sent-188, score-0.535]

61 So, sentences including such topics will have higher saliency scores, which we quantify by imposing topic’s significance on vocabulary: scorePAM(si) =K12XKk2wY∈sip(w|zskub) ∗ TS(zk) (3) Fig. [sent-196, score-0.332]

62 The higher the human summary sentences are ranked, the better the model is in selecting the salient sentences. [sent-201, score-0.327]

63 5 Enriched Two-Tiered Topic Model Our model can discover words that are related to summary text using posteriors and Pˆ(θH) Pˆ(θT), Documents in a Document Cluster Figure 3: Graphical model depiction of sentence level enriched two-tiered model (ETTM) described in section §5. [sent-206, score-0.451]

64 Each Hk1 also represented as distributions over general words WH as well as indicates the degree of correlation between low-level topics denoted by boldness of the arrows. [sent-212, score-0.321]

65 TTM can discover topic correlations, but cannot differentiate if a word in a sentence is more general or specific given a query. [sent-215, score-0.398]

66 Sentences with general words would be more suitable to include in summary text compared to sentences containing specific words. [sent-216, score-0.345]

67 Sentence containing words that are sampled from high-level topics would be a better candidate for summary text. [sent-222, score-0.555]

68 3), which samples words not only from low-level topics but also from high-level topics as well. [sent-224, score-0.438]

69 ETTM discovers three separate distributions over words: (i) high-level topics H as distributions over corpus general words WH, (ii) low-level topics T as distributions over corpus specific words WL, and 496 FeoIrftcL-xwhsei=aζvdm k,e1∼lip G,l=se BeaH1nel,tke. [sent-225, score-0.722]

70 Similar to TTM’s generative process, if wid is related to a given query, then x = 1 is deterministic, otherwise x ∈ {0, 1} is stochastically determistiince,d o tihf wid esh xoul ∈d { be0 sampled as a background word (wB) or through hierarchical path, i. [sent-236, score-1.116]

71 We first sample a sentence si for wid uniformly at random from the sentences containing the word ysi∼Uniform(si)). [sent-239, score-0.68]

72 At this stage we sample a level Lwid ∈ {1, 2} for wid to determine if it is a high-level word, e. [sent-240, score-0.46]

73 Each path through the DAG, defined by a H-T pair (total of K1K2 pairs), has a binomial ζK1K2 over which % of sentences added to the generated summary text. [sent-243, score-0.394]

74 If the word is a specific type, x = 0, then it is sampled from the background word distribution θ, a document specific multinomial. [sent-248, score-0.384]

75 If the word is related to the query x = 1, we sample a high and low-level topic pair H − T as well as an additional level L is sampled tro H Hde −ter Tm ainse w wwelhlic ahs alenve ald odfit topics tehvee wl Lor ids should be sampled from. [sent-252, score-0.889]

76 s Iafn Ld sampled wfroormd the high-level topic, otherwise (L = 2) the word is corpus specific and sampled from a the low-level topic. [sent-255, score-0.336]

77 2 Summary Generation with ETTM For ETTM models, we extend the TTM sentence score to be able to include the effect of the general words in sentences (as word sequences in language 497 models) using probabilities of K1 high-level topic distributions, φHwk=1. [sent-259, score-0.355]

78 K1 Qw∈sip(w|Tk) where p(w|Tk) is theP probabilQity of a word in si being generated from high-level topic Hk. [sent-263, score-0.353]

79 , super topics and subtopics, where super-topics are distributions over abstract words. [sent-268, score-0.286]

80 Thus; ETTM is capable of capturing focused sentences with general words related to the main concepts of the documents and much less redundant sentences containing concepts specific to user query. [sent-271, score-0.614]

81 6 Final Experiments In this section, we qualitatively compare our models against state-of-the art models and later apply an intrinsic evaluation of generated summaries on topical coherence and informativeness. [sent-272, score-0.396]

82 ROUGE Evaluations: We train each document cluster as a separate corpus to find the optimum parameters of each model and evaluate on test document clusters. [sent-280, score-0.335]

83 ROUGE is a commonly used measure, a standard DUC evaluation metric, which computes recall over various n-grams statistics from a model generated summary against a set ofhuman generated summaries. [sent-281, score-0.302]

84 , 2007): Utilizes human generated summaries to train a sentence ranking system using a classifier model; (ii) HIERSUM (Haghighi and Vanderwende, 2009): Based on hierarchical topic models. [sent-288, score-0.541]

85 Using an approximation for inference, sentences are greedily added to a summary so long as they decrease KL-divergence of the generated summary concept distributions from document word-frequency distributions. [sent-289, score-0.632]

86 , 2007): Two hierarchical topic models to discover high and lowlevel concepts from documents, baselines for synthetic experiments in §4 & §5. [sent-293, score-0.643]

87 Because HybHSum uses the human generated summaries as supervision during model development and our systems do not, 498 our performance is quite promising considering the generation is completely unsupervised without seeing any human generated summaries during training. [sent-297, score-0.466]

88 For topic models bigrams tend to degenerate due to generating inconsistent bag of bi-grams (Wallach, 2006). [sent-300, score-0.255]

89 We compare our best model ETTM to the results of PAM, our benchmark model in synthetic experiments, as well as hybrid hierarchical summarization model, hLDA (Celikyilmaz and Hakkani-Tur, 2010). [sent-304, score-0.349]

90 Human annotators are given two sets of summary text for each document set, generated from either one of the two approaches: best ETTM and PAM or best ETTM and HybHSum models. [sent-305, score-0.331]

91 The annotators are asked to mark the better summary according to five criteria: non-redundancy (which summary is less redundant), coherence (which summary is more coherent), focus and readability (content and no unnecessary details), responsiveness and overall performance. [sent-306, score-0.571]

92 We asked 3 annotators to rate DUC2007 predicted summaries (45 summary pairs per annotator). [sent-307, score-0.314]

93 The participants rated ETTM generated summaries more coherent and focused compared to PAM, where the results are statistically significant (based on t-test on 95% confidence level) indicating that ETTM summaries are rated significantly better. [sent-312, score-0.487]

94 7 Conclusion We introduce two new models for extracting topically coherent sentences from documents, an important property in extractive multi-document summarization systems. [sent-315, score-0.438]

95 Our models combine approaches from the hierarchical topic models. [sent-316, score-0.342]

96 size capturing correlated semantic concepts in documents as well as characterizing general and specific words, in order to identify topically coherent sentences in documents. [sent-319, score-0.559]

97 We showed empirically that a fully unsupervised model for extracting general sentences performs well at summarization task using datasets that were originally used in building automatic summarization system challenges. [sent-320, score-0.389]

98 The success of our model can be traced to its capability of directly capturing coherent topics in documents, which makes it able to identify salient sentences. [sent-321, score-0.387]

99 Hierarchical topic models and the nested chinese restaurant process. [sent-353, score-0.255]

100 Query-focused summarization by combining topic model and affinity propagation. [sent-369, score-0.371]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wid', 0.394), ('ettm', 0.372), ('ttm', 0.312), ('pam', 0.223), ('topics', 0.219), ('topic', 0.214), ('summary', 0.164), ('summaries', 0.15), ('sampled', 0.144), ('summarization', 0.127), ('concepts', 0.124), ('document', 0.113), ('starbucks', 0.112), ('xid', 0.112), ('query', 0.102), ('celikyilmaz', 0.099), ('topically', 0.094), ('tkj', 0.093), ('vanderwende', 0.093), ('hierarchical', 0.087), ('coffee', 0.085), ('si', 0.085), ('documents', 0.08), ('coherence', 0.079), ('nenkova', 0.078), ('tiered', 0.076), ('synthetic', 0.075), ('coherent', 0.075), ('hki', 0.074), ('hybhsum', 0.074), ('zskub', 0.074), ('wb', 0.071), ('sentences', 0.07), ('distributions', 0.067), ('duc', 0.066), ('correlations', 0.066), ('discover', 0.065), ('salient', 0.063), ('sj', 0.063), ('ysi', 0.06), ('binomial', 0.058), ('dag', 0.057), ('rouge', 0.056), ('pttm', 0.056), ('ventures', 0.056), ('generated', 0.054), ('tang', 0.054), ('blei', 0.053), ('redundancy', 0.053), ('pachinko', 0.049), ('specific', 0.048), ('path', 0.048), ('barzilay', 0.045), ('cluster', 0.045), ('saliency', 0.043), ('qualitative', 0.042), ('clusters', 0.041), ('models', 0.041), ('depiction', 0.04), ('mimno', 0.04), ('sample', 0.04), ('haghighi', 0.04), ('li', 0.04), ('hierarchal', 0.037), ('lowlevel', 0.037), ('phthy', 0.037), ('zsd', 0.037), ('microsoft', 0.037), ('sentence', 0.036), ('general', 0.035), ('user', 0.035), ('evaluations', 0.034), ('optimum', 0.034), ('correlated', 0.033), ('generative', 0.033), ('eisenstein', 0.033), ('conroy', 0.033), ('sip', 0.033), ('stochastically', 0.033), ('mccallum', 0.031), ('extractive', 0.031), ('topical', 0.031), ('background', 0.031), ('ts', 0.031), ('harabagiu', 0.03), ('enriched', 0.03), ('model', 0.03), ('rated', 0.029), ('generation', 0.028), ('diversify', 0.028), ('highlevel', 0.028), ('containing', 0.028), ('discovery', 0.027), ('uniformly', 0.027), ('ds', 0.027), ('daum', 0.026), ('zk', 0.026), ('labs', 0.026), ('level', 0.026), ('acyclic', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

2 0.22003561 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

Author: Risa Kitajima ; Ichiro Kobayashi

Abstract: Recently, several latent topic analysis methods such as LSI, pLSI, and LDA have been widely used for text analysis. However, those methods basically assign topics to words, but do not account for the events in a document. With this background, in this paper, we propose a latent topic extracting method which assigns topics to events. We also show that our proposed method is useful to generate a document summary based on a latent topic.

3 0.21363746 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

Author: Ivan Vulic ; Wim De Smet ; Marie-Francine Moens

Abstract: A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed methods which only use knowledge from word-topic distributions outperform methods based on similarity measures in the original word-document space. The best results, obtained by combining knowledge from wordtopic distributions with similarity measures in the original space, are also reported.

4 0.19921656 52 acl-2011-Automatic Labelling of Topic Models

Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin

Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.

5 0.178811 14 acl-2011-A Hierarchical Model of Web Summaries

Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani

Abstract: We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.

6 0.17386283 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

7 0.17183052 117 acl-2011-Entity Set Expansion using Topic information

8 0.17157143 76 acl-2011-Comparative News Summarization Using Linear Programming

9 0.1687566 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

10 0.16484419 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis

11 0.15950547 178 acl-2011-Interactive Topic Modeling

12 0.14741349 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

13 0.14425167 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

14 0.13617344 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation

15 0.13001595 187 acl-2011-Jointly Learning to Extract and Compress

16 0.12822582 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

17 0.11496936 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers

18 0.11201226 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

19 0.10352671 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

20 0.10148533 4 acl-2011-A Class of Submodular Functions for Document Summarization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.204), (1, 0.14), (2, -0.102), (3, 0.2), (4, -0.074), (5, -0.145), (6, -0.182), (7, 0.291), (8, 0.019), (9, 0.033), (10, -0.097), (11, 0.05), (12, -0.035), (13, -0.026), (14, 0.024), (15, -0.019), (16, -0.026), (17, 0.027), (18, -0.025), (19, 0.093), (20, -0.075), (21, 0.05), (22, 0.018), (23, 0.013), (24, -0.015), (25, 0.017), (26, 0.018), (27, -0.033), (28, -0.028), (29, -0.019), (30, -0.053), (31, -0.031), (32, 0.039), (33, -0.01), (34, -0.035), (35, -0.003), (36, 0.082), (37, -0.004), (38, -0.032), (39, 0.026), (40, 0.02), (41, 0.043), (42, 0.003), (43, -0.032), (44, -0.054), (45, 0.007), (46, -0.007), (47, 0.003), (48, 0.027), (49, 0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96862453 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

2 0.8765744 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization

Author: William M. Darling ; Fei Song

Abstract: Statistical approaches to automatic text summarization based on term frequency continue to perform on par with more complex summarization methods. To compute useful frequency statistics, however, the semantically important words must be separated from the low-content function words. The standard approach of using an a priori stopword list tends to result in both undercoverage, where syntactical words are seen as semantically relevant, and overcoverage, where words related to content are ignored. We present a generative probabilistic modeling approach to building content distributions for use with statistical multi-document summarization where the syntax words are learned directly from the data with a Hidden Markov Model and are thereby deemphasized in the term frequency statistics. This approach is compared to both a stopword-list and POS-tagging approach and our method demonstrates improved coverage on the DUC 2006 and TAC 2010 datasets using the ROUGE metric.

3 0.83903378 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application

Author: Risa Kitajima ; Ichiro Kobayashi

4 0.82676858 287 acl-2011-Structural Topic Model for Latent Topical Structure Analysis

Author: Hongning Wang ; Duo Zhang ; ChengXiang Zhai

Abstract: Topic models have been successfully applied to many document analysis tasks to discover topics embedded in text. However, existing topic models generally cannot capture the latent topical structures in documents. Since languages are intrinsically cohesive and coherent, modeling and discovering latent topical transition structures within documents would be beneficial for many text analysis tasks. In this work, we propose a new topic model, Structural Topic Model, which simultaneously discovers topics and reveals the latent topical structures in text through explicitly modeling topical transitions with a latent first-order Markov chain. Experiment results show that the proposed Structural Topic Model can effectively discover topical structures in text, and the identified structures significantly improve the performance of tasks such as sentence annotation and sentence ordering. ,

5 0.79030079 14 acl-2011-A Hierarchical Model of Web Summaries

Author: Yves Petinot ; Kathleen McKeown ; Kapil Thadani

6 0.73251486 178 acl-2011-Interactive Topic Modeling

7 0.72943002 76 acl-2011-Comparative News Summarization Using Linear Programming

8 0.72476995 52 acl-2011-Automatic Labelling of Topic Models

9 0.67776829 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles

10 0.67773288 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

11 0.67297536 305 acl-2011-Topical Keyphrase Extraction from Twitter

12 0.65725607 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

13 0.65407366 117 acl-2011-Entity Set Expansion using Topic information

14 0.6490466 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents

15 0.62414211 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

16 0.59202361 187 acl-2011-Jointly Learning to Extract and Compress

17 0.58592373 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice

18 0.57047796 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering

19 0.56721312 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization

20 0.53134823 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.018), (17, 0.045), (26, 0.019), (37, 0.096), (39, 0.067), (41, 0.05), (45, 0.256), (55, 0.026), (59, 0.051), (72, 0.029), (91, 0.043), (96, 0.185), (97, 0.014), (98, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87234414 176 acl-2011-Integrating surprisal and uncertain-input models in online sentence comprehension: formal techniques and empirical results

Author: Roger Levy

Abstract: A system making optimal use of available information in incremental language comprehension might be expected to use linguistic knowledge together with current input to revise beliefs about previous input. Under some circumstances, such an error-correction capability might induce comprehenders to adopt grammatical analyses that are inconsistent with the true input. Here we present a formal model of how such input-unfaithful garden paths may be adopted and the difficulty incurred by their subsequent disconfirmation, combining a rational noisy-channel model of syntactic comprehension under uncertain input with the surprisal theory of incremental processing difficulty. We also present a behavioral experiment confirming the key empirical predictions of the theory.

same-paper 2 0.79224491 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization

Author: Asli Celikyilmaz ; Dilek Hakkani-Tur

3 0.77019078 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes

Author: Thomas Mueller ; Hinrich Schuetze

Abstract: We present a class-based language model that clusters rare words of similar morphology together. The model improves the prediction of words after histories containing outof-vocabulary words. The morphological features used are obtained without the use of labeled data. The perplexity improvement compared to a state of the art Kneser-Ney model is 4% overall and 81% on unknown histories.

4 0.75501192 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

Author: Omar F. Zaidan ; Chris Callison-Burch

Abstract: Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional trans- lators. The total cost is more than an order of magnitude lower than professional translation.

5 0.67981195 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

Author: Zhongguo Li

Abstract: Lots of Chinese characters are very productive in that they can form many structured words either as prefixes or as suffixes. Previous research in Chinese word segmentation mainly focused on identifying only the word boundaries without considering the rich internal structures of many words. In this paper we argue that this is unsatisfying in many ways, both practically and theoretically. Instead, we propose that word structures should be recovered in morphological analysis. An elegant approach for doing this is given and the result is shown to be promising enough for encouraging further effort in this direction. Our probability model is trained with the Penn Chinese Treebank and actually is able to parse both word and phrase structures in a unified way. 1 Why Parse Word Structures? Research in Chinese word segmentation has progressed tremendously in recent years, with state of the art performing at around 97% in precision and recall (Xue, 2003; Gao et al., 2005; Zhang and Clark, 2007; Li and Sun, 2009). However, virtually all these systems focus exclusively on recognizing the word boundaries, giving no consideration to the internal structures of many words. Though it has been the standard practice for many years, we argue that this paradigm is inadequate both in theory and in practice, for at least the following four reasons. The first reason is that if we confine our definition of word segmentation to the identification of word boundaries, then people tend to have divergent 1405 opinions as to whether a linguistic unit is a word or not (Sproat et al., 1996). This has led to many different annotation standards for Chinese word segmentation. Even worse, this could cause inconsistency in the same corpus. For instance, 䉂擌奒 ‘vice president’ is considered to be one word in the Penn Chinese Treebank (Xue et al., 2005), but is split into two words by the Peking University corpus in the SIGHAN Bakeoffs (Sproat and Emerson, 2003). Meanwhile, 䉂䀓惼 ‘vice director’ and 䉂䚲䡮 ‘deputy are both segmented into two words in the same Penn Chinese Treebank. In fact, all these words are composed of the prefix 䉂 ‘vice’ and a root word. Thus the structure of 䉂擌奒 ‘vice president’ can be represented with the tree in Figure 1. Without a doubt, there is complete agree- manager’ NN ,,ll JJf NNf 䉂擌奒 Figure 1: Example of a word with internal structure. ment on the correctness of this structure among native Chinese speakers. So if instead of annotating only word boundaries, we annotate the structures of every word, then the annotation tends to be more 1 1Here it is necessary to add a note on terminology used in this paper. Since there is no universally accepted definition of the “word” concept in linguistics and especially in Chinese, whenever we use the term “word” we might mean a linguistic unit such as 䉂擌奒 ‘vice president’ whose structure is shown as the tree in Figure 1, or we might mean a smaller unit such as 擌奒 ‘president’ which is a substructure of that tree. Hopefully, ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s405–1414, consistent and there could be less duplication of efforts in developing the expensive annotated corpus. The second reason is applications have different requirements for granularity of words. Take the personal name 撱嗤吼 ‘Zhou Shuren’ as an example. It’s considered to be one word in the Penn Chinese Treebank, but is segmented into a surname and a given name in the Peking University corpus. For some applications such as information extraction, the former segmentation is adequate, while for others like machine translation, the later finer-grained output is more preferable. If the analyzer can produce a structure as shown in Figure 4(a), then every application can extract what it needs from this tree. A solution with tree output like this is more elegant than approaches which try to meet the needs of different applications in post-processing (Gao et al., 2004). The third reason is that traditional word segmentation has problems in handling many phenomena in Chinese. For example, the telescopic compound 㦌撥怂惆 ‘universities, middle schools and primary schools’ is in fact composed ofthree coordinating elements 㦌惆 ‘university’, 撥惆 ‘middle school’ and 怂惆 ‘primary school’ . Regarding it as one flat word loses this important information. Another example is separable words like 扩扙 ‘swim’ . With a linear segmentation, the meaning of ‘swimming’ as in 扩堑扙 ‘after swimming’ cannot be properly represented, since 扩扙 ‘swim’ will be segmented into discontinuous units. These language usages lie at the boundary between syntax and morphology, and are not uncommon in Chinese. They can be adequately represented with trees (Figure 2). (a) NN (b) ???HHH JJ NNf ???HHH JJf JJf JJf 㦌撥怂惆 VV ???HHH VV NNf ZZ VVf VVf 扩扙堑 Figure 2: Example of telescopic compound (a) and separable word (b). The last reason why we should care about word the context will always make it clear what is being referred to with the term “word”. 1406 structures is related to head driven statistical parsers (Collins, 2003). To illustrate this, note that in the Penn Chinese Treebank, the word 戽䊂䠽吼 ‘English People’ does not occur at all. Hence constituents headed by such words could cause some difficulty for head driven models in which out-ofvocabulary words need to be treated specially both when they are generated and when they are conditioned upon. But this word is in turn headed by its suffix 吼 ‘people’, and there are 2,233 such words in Penn Chinese Treebank. If we annotate the structure of every compound containing this suffix (e.g. Figure 3), such data sparsity simply goes away.

6 0.67835492 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

7 0.67818809 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

8 0.67612994 28 acl-2011-A Statistical Tree Annotator and Its Applications

9 0.67483354 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing

10 0.67472261 117 acl-2011-Entity Set Expansion using Topic information

11 0.67469382 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

12 0.67318159 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

13 0.67228055 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

14 0.67212427 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations

15 0.67181265 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

16 0.67170674 187 acl-2011-Jointly Learning to Extract and Compress

17 0.67109662 5 acl-2011-A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing

18 0.67091322 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

19 0.67032945 178 acl-2011-Interactive Topic Modeling

20 0.66970891 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework