acl acl2011 acl2011-71 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amjad Abu-Jbara ; Dragomir Radev
Abstract: In citation-based summarization, text written by several researchers is leveraged to identify the important aspects of a target paper. Previous work on this problem focused almost exclusively on its extraction aspect (i.e. selecting a representative set of citation sentences that highlight the contribution of the target paper). Meanwhile, the fluency of the produced summaries has been mostly ignored. For example, diversity, readability, cohesion, and ordering of the sentences included in the summary have not been thoroughly considered. This resulted in noisy and confusing summaries. In this work, we present an approach for producing readable and cohesive citation-based summaries. Our experiments show that the pro- posed approach outperforms several baselines in terms of both extraction quality and fluency.
Reference: text
sentIndex sentText sentNum sentScore
1 selecting a representative set of citation sentences that highlight the contribution of the target paper). [sent-5, score-0.943]
2 For example, diversity, readability, cohesion, and ordering of the sentences included in the summary have not been thoroughly considered. [sent-7, score-0.328]
3 When a reference appears in a scientific paper, it is often accompanied by a span of text describing the work being cited. [sent-14, score-0.291]
4 We name the sentence that contains an explicit reference to another paper citation sentence. [sent-15, score-0.995]
5 Citation sentences usually highlight the most important aspects of the cited paper such as the research problem it addresses, the method it proposes, the good results it reports, and even its drawbacks and limitations. [sent-16, score-0.241]
6 By aggregating all the citation sentences that cite a paper, we have a rich source of information about 500 Dragomir Radev EECS Department and School of Information University of Michigan Ann Arbor, MI, USA radev@ umi ch . [sent-17, score-0.991]
7 One way to make use of these sentences is creating a summary of the target paper. [sent-20, score-0.363]
8 This summary is different from the abstract or a summary generated from the paper itself. [sent-21, score-0.266]
9 While the abstract represents the author’s point of view, the citation summary is the summation of multiple scholars’ viewpoints. [sent-22, score-0.817]
10 The task of summarizing a scientific paper using its set of citation sentences is called citationbased summarization. [sent-23, score-1.004]
11 analyzing the collection of citation sentences and selecting a representative subset that covers the main aspects of the paper. [sent-30, score-0.954]
12 The cohesion and the readability of the produced summaries have been mostly ignored. [sent-31, score-0.366]
13 In this work, we focus on the coherence and readability aspects of the problem. [sent-33, score-0.252]
14 Our experiments show that our approach produces better summaries than several baseline summarization systems. [sent-35, score-0.268]
15 Ac s2s0o1ci1a Atiosnso fcoirat Cio nm foprut Caotimonpaulta Lti nognuails Lti cnsg,u piasgteics 50 –509, 2 Related Work The idea of analyzing and utilizing citation information is far from new. [sent-43, score-0.684]
16 Nanba and Okumura (2000) analyzed citation sentences and automatically categorized citations into three groups using 160 pre-defined phrasebased rules. [sent-47, score-0.902]
17 They also used citation categorization to support a system for writing surveys (Nanba and Okumura, 1999). [sent-48, score-0.684]
18 Newman (2001) analyzed the structure of the citation networks. [sent-49, score-0.684]
19 Siddharthan and Teufel (2007) proposed a method for determining the scientific attribution of an article by analyzing citation sentences. [sent-52, score-0.786]
20 (2008) performed a study on citation summaries and their importance. [sent-56, score-0.883]
21 They concluded that citation summaries are more focused and contain more information than abstracts. [sent-57, score-0.883]
22 (2009) suggested using citation information to generate surveys of scientific paradigms. [sent-59, score-0.786]
23 (2010) proposed a citation-based summarization method that first extracts a number of important keyphrases from the set of citation sentences, and then finds the best subset of sentences that covers as many keyphrases as possible. [sent-63, score-1.035]
24 501 3 Motivation The coherence and readability of citation-based summaries are impeded by several factors. [sent-65, score-0.408]
25 First, many citation sentences cite multiple papers besides the target. [sent-66, score-1.043]
26 For example, the following is a citation sentence that appeared in the NLP literature and talked about Resnik’s (1999) work. [sent-67, score-0.777]
27 Including the irrelevant fragments in the summary causes several problems. [sent-71, score-0.253]
28 These fragments take space in the summary while being irrelevant and less important. [sent-73, score-0.253]
29 Second, including these fragments in the summary breaks the context and, hence, degrades the readability and confuses the reader. [sent-74, score-0.338]
30 Third, the existence of irrelevant fragments in a sentence makes the ranking algorithm assign a low weight to it although the relevant fragment may cover an aspect of the paper that no other sentence covers. [sent-75, score-0.465]
31 For example, the following are two other citation sentences for Resnik (1999). [sent-77, score-0.85]
32 If these two sentences are to be included in the summary, the reasonable ordering would be to put the second sentence first. [sent-80, score-0.288]
33 Thirdly, in some instances of citation sentences, the reference is not a syntactic constituent in the sentence. [sent-81, score-0.873]
34 For example, in sentence (2) above, the reference could be safely removed from the sentence without hurting its grammaticality. [sent-83, score-0.404]
35 sentence (3) above), the reference is a syntactic constituent of the sentence and removing it makes the sentence ungrammatical. [sent-86, score-0.468]
36 However, in certain cases, the reference could be replaced with a suitable pronoun (i. [sent-87, score-0.306]
37 Finally, a significant number of citation sentences are not suitable for summarization (Teufel et al. [sent-91, score-0.959]
38 Teufel (2007) reported that a significant number of citation sentences (67% of the sentences in her dataset) were of this type. [sent-100, score-1.016]
39 This sentence alone does not provide any valuable information about Eisner’s paper and should not be added to the summary unless its context is extracted and included in the summary as well. [sent-104, score-0.359]
40 4 Approach In this section we describe a system that takes a scientific paper and a set of citation sentences that cite it as input, and outputs a citation summary of the paper. [sent-106, score-1.875]
41 In the first stage, the citation sentences are 502 preprocessed to rule out the unsuitable sentences and the irrelevant fragments of sentences. [sent-108, score-1.255]
42 In the second stage, a number of citation sentences that cover the various aspects of the paper are selected. [sent-109, score-0.935]
43 In the last stage, the selected sentences are post-processed to enhance the readability of the summary. [sent-110, score-0.301]
44 1 Preprocessing The aim of this stage is to determine which pieces of text (sentences or fragments of sentences) should be considered for selection in the next stage and which ones should be excluded. [sent-113, score-0.263]
45 This stage involves three tasks: reference tagging, reference scope identification, and sentence filtering. [sent-114, score-0.668]
46 1 Reference Tagging A citation sentence contains one or more references. [sent-117, score-0.777]
47 The reference to the target is given a different tag than the references to other papers. [sent-121, score-0.253]
48 The following example shows a citation sentence with all the references tagged and the target reference given a different tag. [sent-122, score-1.062]
49 the fragment of the citation sentence that corresponds to the target paper. [sent-128, score-0.906]
50 We define the scope of a reference as the shortest fragment of the citation sentence that contains the reference and could form a grammatical sentence if the rest of the sentence was removed. [sent-129, score-1.522]
51 Since the parser is not trained on citation sentences, we replace the references with placeholders before passing the sentence to the parser. [sent-132, score-0.777]
52 Figure 1: An example showing the scope of a target reference We extract the scope of the reference from the parse tree as follows. [sent-134, score-0.674]
53 We find the smallest subtree rooted at an S node (sentence clause node) and contains the target reference node. [sent-135, score-0.353]
54 For example, the parse tree shown in Figure 1 suggests that the scope of the reference is: Resnik (1999) describes a method for mining the web for bilingual texts. [sent-138, score-0.34]
55 Formally, we classify the citation sentences into two classes: suitable and unsuitable sentences. [sent-147, score-1.041]
56 2 Extraction In the first stage, the sentences and sentence fragments that are not useful for our summarization task are ruled out. [sent-154, score-0.398]
57 The input to this stage is a set of citation sentences that are believed to be suitable for the summary. [sent-155, score-0.971]
58 The sentences are selected based on these three main properties: First, they should cover diverse aspects of the paper. [sent-157, score-0.251]
59 Second, the sentences that cover the same aspect should not contain redundant information. [sent-158, score-0.26]
60 For example, if two sentences talk about the drawbacks of the target paper, one sentence can mention the computation inefficiency, while the other criticize the assumptions the paper makes. [sent-159, score-0.355]
61 Third, the sentences should cover as many important facts about the target paper as possible using minimal text. [sent-160, score-0.272]
62 In this stage, the summary sentences are selected in three steps. [sent-161, score-0.299]
63 In the second step, we cluster the sentences within each category into clusters of similar sentences. [sent-163, score-0.346]
64 The summary sentences are selected based on the classification, the clustering, and the LexRank values. [sent-165, score-0.299]
65 1 Functional Category Classification We classify the citation sentences into the five categories mentioned above using a machine learning technique. [sent-168, score-0.919]
66 A classification model is trained on a number of features (Table 2) extracted from a labeled set of citation sentences. [sent-169, score-0.714]
67 2 Sentence Clustering In the previous step we determined the category of each citation sentence. [sent-173, score-0.807]
68 It is very likely that sentences from the same category contain similar or overlapping information. [sent-174, score-0.258]
69 For example, Sentences (6), (7), and (8) below appear in the set of citation sentences that cite Goldwater and Griffiths’ (2007). [sent-175, score-0.956]
70 Clustering divides the sentences of each category into groups of similar sentences. [sent-181, score-0.258]
71 Clusters within each category are ordered by the number of sentences in them whereas the sentences of each cluster are ordered by their LexRank values. [sent-198, score-0.549]
72 ) If the desired length of the summary is 3 sentences, the selected sentences will be in order S 1, S12, then S 18. [sent-202, score-0.299]
73 Each citation sentence will have the target reference (the author’s names and the publication year) mentioned at least once. [sent-206, score-1.074]
74 The reference could be either syntactically and semantically part of the sentence (e. [sent-207, score-0.282]
75 In the following sentences, we either replace the reference with a suitable personal pronoun or remove it. [sent-214, score-0.31]
76 The reference is replaced with a pronoun if it is part of the sentence and this replacement does not make the sentence ungrammatical. [sent-215, score-0.491]
77 To determine whether a reference is part of the sentence or not, we again use a machine learning approach. [sent-218, score-0.282]
78 If a reference is to be replaced, and the paper has one author, we use ”he/she” (we do not know if the author is male or female). [sent-222, score-0.25]
79 Then we evaluate the summaries that our system generate in terms of extraction quality. [sent-226, score-0.241]
80 AAN provides all citation information from within the network including the citation network, the citation sentences, and the citation context for each paper. [sent-232, score-2.794]
81 The papers have a variable number of citation sentences, ranging from 15 to 348. [sent-234, score-0.771]
82 The total number of citation sentences in the dataset is 4,335. [sent-235, score-0.85]
83 The agreement among the three annotators on distinguishing the unsuitable sentences from the other five categories is 0. [sent-243, score-0.322]
84 The agreement on classifying the sentences into the five functional categories is 0. [sent-246, score-0.245]
85 We asked humans with a good background in NLP (the papers topic) to generate a readable, coherent summary for each paper in the set using its citation sentences as the source text. [sent-250, score-1.177]
86 We asked them to fix the length of the summaries to 5 sentences. [sent-251, score-0.234]
87 2 Component Evaluation Reference Tagging and Reference Scope Identification Evaluation: We ran our reference tagging and scope identification components on the 2,284 sentences in dataset1. [sent-254, score-0.576]
88 Then, we went through the tagged sentences and the extracted scopes, and counted the number of correctly/incorrectly tagged (extracted)/missed references (scopes). [sent-255, score-0.23]
89 The reference to the target paper was tagged correctly in all the sentences. [sent-265, score-0.285]
90 Our scope identification component extracted the scope of target references with good precision (86. [sent-266, score-0.371]
91 In fact, extracting a useful scope for a reference requires more than just finding a grammatical substring. [sent-269, score-0.305]
92 We use our system to generate summaries for each of the 30 papers in dataset2. [sent-285, score-0.286]
93 We also generate summaries for the papers using a number of baseline systems (described in Section 5. [sent-286, score-0.286]
94 In the first baseline, the sentences are selected randomly from the set of citation sentences and added to the summary. [sent-294, score-1.016]
95 The third baseline is LexRank (Erkan and Radev, 2004) run on the entire set of citation sentences of the target paper. [sent-297, score-0.914]
96 The forth baseline is Qazvinian and Radev (2008) citation-based summarizer (QR08) in which the citation sentences are first clustered then the sentences within each cluster are ranked using LexRank. [sent-298, score-1.107]
97 In another variation (FL-2), we remove the sentence classification component; so, all the sen507 tences are assumed to come from one category in the subsequent components. [sent-301, score-0.322]
98 To make the comparison of the extraction quality to those baselines fair, we remove the author name replacement component from our system and all its variations. [sent-303, score-0.282]
99 2 Results Table 6 shows the average ROUGE-L scores (with 95% confidence interval) for the summaries of the 30 papers in dataset2 generated using our system and the different baselines. [sent-306, score-0.286]
100 4 Coherence and Readability Evaluation We asked human judges (not including the authors) to rate the coherence and readability of a number of summaries for each of dataset2 papers. [sent-317, score-0.443]
wordName wordTfidf (topN-words)
[('citation', 0.684), ('summaries', 0.199), ('reference', 0.189), ('sentences', 0.166), ('radev', 0.161), ('resnik', 0.135), ('readability', 0.135), ('summary', 0.133), ('unsuitable', 0.119), ('lexrank', 0.117), ('scope', 0.116), ('qazvinian', 0.108), ('cite', 0.106), ('scientific', 0.102), ('teufel', 0.094), ('sentence', 0.093), ('category', 0.092), ('papers', 0.087), ('stage', 0.081), ('eisner', 0.076), ('coherence', 0.074), ('fragments', 0.07), ('aan', 0.069), ('summarization', 0.069), ('goldwater', 0.067), ('fragment', 0.065), ('target', 0.064), ('griffiths', 0.061), ('author', 0.061), ('nanba', 0.06), ('network', 0.058), ('cluster', 0.055), ('citationbased', 0.052), ('elkiss', 0.052), ('erkan', 0.052), ('citations', 0.052), ('aspect', 0.052), ('irrelevant', 0.05), ('pronoun', 0.045), ('publication', 0.044), ('background', 0.044), ('aspects', 0.043), ('functional', 0.042), ('clauset', 0.042), ('keyphrases', 0.042), ('variation', 0.042), ('extraction', 0.042), ('cover', 0.042), ('mohammad', 0.04), ('suitable', 0.04), ('filtering', 0.039), ('subtree', 0.039), ('replacement', 0.039), ('component', 0.038), ('year', 0.038), ('kappa', 0.038), ('scopes', 0.038), ('tagging', 0.037), ('identification', 0.037), ('baselines', 0.037), ('categories', 0.037), ('summarizer', 0.036), ('remove', 0.036), ('describes', 0.035), ('okumura', 0.035), ('eecs', 0.035), ('umi', 0.035), ('asked', 0.035), ('ordered', 0.035), ('clustering', 0.034), ('confusing', 0.034), ('repeating', 0.034), ('rooted', 0.034), ('fl', 0.034), ('readable', 0.034), ('stages', 0.033), ('clusters', 0.033), ('drawbacks', 0.032), ('cohesion', 0.032), ('replaced', 0.032), ('covers', 0.032), ('tagged', 0.032), ('classify', 0.032), ('step', 0.031), ('components', 0.031), ('aim', 0.031), ('svm', 0.031), ('classification', 0.03), ('representative', 0.029), ('ordering', 0.029), ('removed', 0.029), ('name', 0.029), ('michigan', 0.029), ('tences', 0.029), ('humans', 0.028), ('kernel', 0.028), ('henceforth', 0.027), ('nguyen', 0.027), ('motivation', 0.027), ('smallest', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers
Author: Amjad Abu-Jbara ; Dragomir Radev
Abstract: In citation-based summarization, text written by several researchers is leveraged to identify the important aspects of a target paper. Previous work on this problem focused almost exclusively on its extraction aspect (i.e. selecting a representative set of citation sentences that highlight the contribution of the target paper). Meanwhile, the fluency of the produced summaries has been mostly ignored. For example, diversity, readability, cohesion, and ordering of the sentences included in the summary have not been thoroughly considered. This resulted in noisy and confusing summaries. In this work, we present an approach for producing readable and cohesive citation-based summaries. Our experiments show that the pro- posed approach outperforms several baselines in terms of both extraction quality and fluency.
2 0.41787544 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features
Author: Awais Athar
Abstract: Sentiment analysis of citations in scientific papers and articles is a new and interesting problem due to the many linguistic differences between scientific texts and other genres. In this paper, we focus on the problem of automatic identification of positive and negative sentiment polarity in citations to scientific papers. Using a newly constructed annotated citation sentiment corpus, we explore the effectiveness of existing and novel features, including n-grams, specialised science-specific lexical features, dependency relations, sentence splitting and negation features. Our results show that 3-grams and dependencies perform best in this task; they outperform the sentence splitting, science lexicon and negation based features.
3 0.36261809 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts
Author: Clinton Burfoot ; Steven Bird ; Timothy Baldwin
Abstract: This paper explores approaches to sentiment classification of U.S. Congressional floordebate transcripts. Collective classification techniques are used to take advantage of the informal citation structure present in the debates. We use a range of methods based on local and global formulations and introduce novel approaches for incorporating the outputs of machine learners into collective classification algorithms. Our experimental evaluation shows that the mean-field algorithm obtains the best results for the task, significantly outperforming the benchmark technique.
4 0.25450569 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles
Author: Nitin Agarwal ; Ravi Shankar Reddy ; Kiran GVR ; Carolyn Penstein Rose
Abstract: In this demo, we present SciSumm, an interactive multi-document summarization system for scientific articles. The document collection to be summarized is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each article based on queries generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. SciSumm is currently built over the 2008 ACL Anthology, however the gen- eralizable nature of the summarization techniques and the extensible architecture makes it possible to use the system with other corpora where a citation network is available. Evaluation results on the same corpus demonstrate that our system performs better than an existing widely used multi-document summarization system (MEAD).
5 0.20315997 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice
Author: Vahed Qazvinian ; Dragomir R. Radev
Abstract: We analyze collective discourse, a collective human behavior in content generation, and show that it exhibits diversity, a property of general collective systems. Using extensive analysis, we propose a novel paradigm for designing summary generation systems that reflect the diversity of perspectives seen in reallife collective summarization. We analyze 50 sets of summaries written by human about the same story or artifact and investigate the diversity of perspectives across these summaries. We show how different summaries use various phrasal information units (i.e., nuggets) to express the same atomic semantic units, called factoids. Finally, we present a ranker that employs distributional similarities to build a net- work of words, and captures the diversity of perspectives by detecting communities in this network. Our experiments show how our system outperforms a wide range of other document ranking systems that leverage diversity.
6 0.14948085 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation
7 0.12056258 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
8 0.11824746 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
9 0.11496936 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
10 0.11029626 187 acl-2011-Jointly Learning to Extract and Compress
11 0.10896607 298 acl-2011-The ACL Anthology Searchbench
12 0.10614394 76 acl-2011-Comparative News Summarization Using Linear Programming
13 0.10201069 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents
14 0.096836127 50 acl-2011-Automatic Extraction of Lexico-Syntactic Patterns for Detection of Negation and Speculation Scopes
15 0.085359298 8 acl-2011-A Corpus of Scope-disambiguated English Text
16 0.079110831 188 acl-2011-Judging Grammaticality with Tree Substitution Grammar Derivations
17 0.076795429 174 acl-2011-Insights from Network Structure for Text Mining
18 0.075790502 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports
19 0.073104478 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis
20 0.066831157 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
topicId topicWeight
[(0, 0.2), (1, 0.104), (2, -0.012), (3, 0.034), (4, -0.036), (5, 0.008), (6, -0.064), (7, 0.164), (8, -0.011), (9, -0.102), (10, -0.108), (11, -0.085), (12, -0.237), (13, 0.015), (14, -0.277), (15, -0.089), (16, -0.005), (17, 0.01), (18, 0.132), (19, -0.077), (20, -0.128), (21, -0.133), (22, 0.16), (23, -0.087), (24, 0.076), (25, 0.142), (26, -0.132), (27, 0.271), (28, -0.186), (29, 0.005), (30, -0.177), (31, 0.029), (32, -0.018), (33, 0.046), (34, 0.058), (35, -0.065), (36, 0.056), (37, -0.018), (38, -0.047), (39, 0.076), (40, -0.033), (41, 0.004), (42, -0.0), (43, 0.05), (44, -0.027), (45, -0.07), (46, 0.023), (47, -0.041), (48, -0.01), (49, -0.131)]
simIndex simValue paperId paperTitle
same-paper 1 0.95040995 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers
Author: Amjad Abu-Jbara ; Dragomir Radev
Abstract: In citation-based summarization, text written by several researchers is leveraged to identify the important aspects of a target paper. Previous work on this problem focused almost exclusively on its extraction aspect (i.e. selecting a representative set of citation sentences that highlight the contribution of the target paper). Meanwhile, the fluency of the produced summaries has been mostly ignored. For example, diversity, readability, cohesion, and ordering of the sentences included in the summary have not been thoroughly considered. This resulted in noisy and confusing summaries. In this work, we present an approach for producing readable and cohesive citation-based summaries. Our experiments show that the pro- posed approach outperforms several baselines in terms of both extraction quality and fluency.
2 0.7957859 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts
Author: Clinton Burfoot ; Steven Bird ; Timothy Baldwin
Abstract: This paper explores approaches to sentiment classification of U.S. Congressional floordebate transcripts. Collective classification techniques are used to take advantage of the informal citation structure present in the debates. We use a range of methods based on local and global formulations and introduce novel approaches for incorporating the outputs of machine learners into collective classification algorithms. Our experimental evaluation shows that the mean-field algorithm obtains the best results for the task, significantly outperforming the benchmark technique.
3 0.72525996 201 acl-2011-Learning From Collective Human Behavior to Introduce Diversity in Lexical Choice
Author: Vahed Qazvinian ; Dragomir R. Radev
Abstract: We analyze collective discourse, a collective human behavior in content generation, and show that it exhibits diversity, a property of general collective systems. Using extensive analysis, we propose a novel paradigm for designing summary generation systems that reflect the diversity of perspectives seen in reallife collective summarization. We analyze 50 sets of summaries written by human about the same story or artifact and investigate the diversity of perspectives across these summaries. We show how different summaries use various phrasal information units (i.e., nuggets) to express the same atomic semantic units, called factoids. Finally, we present a ranker that employs distributional similarities to build a net- work of words, and captures the diversity of perspectives by detecting communities in this network. Our experiments show how our system outperforms a wide range of other document ranking systems that leverage diversity.
4 0.68103731 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles
Author: Nitin Agarwal ; Ravi Shankar Reddy ; Kiran GVR ; Carolyn Penstein Rose
Abstract: In this demo, we present SciSumm, an interactive multi-document summarization system for scientific articles. The document collection to be summarized is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each article based on queries generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. SciSumm is currently built over the 2008 ACL Anthology, however the gen- eralizable nature of the summarization techniques and the extensible architecture makes it possible to use the system with other corpora where a citation network is available. Evaluation results on the same corpus demonstrate that our system performs better than an existing widely used multi-document summarization system (MEAD).
5 0.61038697 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis
Author: Amjad Abu-Jbara ; Dragomir Radev
Abstract: In this paper we present Clairlib, an opensource toolkit for Natural Language Processing, Information Retrieval, and Network Analysis. Clairlib provides an integrated framework intended to simplify a number of generic tasks within and across those three areas. It has a command-line interface, a graphical interface, and a documented API. Clairlib is compatible with all the common platforms and operating systems. In addition to its own functionality, it provides interfaces to external software and corpora. Clairlib comes with a comprehensive documentation and a rich set of tutorials and visual demos.
6 0.60071206 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features
7 0.50212908 298 acl-2011-The ACL Anthology Searchbench
8 0.43950999 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents
9 0.43071947 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation
10 0.41347453 8 acl-2011-A Corpus of Scope-disambiguated English Text
11 0.40440175 187 acl-2011-Jointly Learning to Extract and Compress
12 0.39367589 50 acl-2011-Automatic Extraction of Lexico-Syntactic Patterns for Detection of Negation and Speculation Scopes
13 0.39072573 76 acl-2011-Comparative News Summarization Using Linear Programming
14 0.38379118 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
15 0.35631528 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues
16 0.33457169 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
17 0.33334762 273 acl-2011-Semantic Representation of Negation Using Focus Detection
18 0.33117181 174 acl-2011-Insights from Network Structure for Text Mining
19 0.32769984 98 acl-2011-Discovery of Topically Coherent Sentences for Extractive Summarization
20 0.32572648 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
topicId topicWeight
[(5, 0.052), (11, 0.166), (17, 0.044), (18, 0.011), (26, 0.021), (37, 0.098), (39, 0.04), (41, 0.049), (55, 0.024), (59, 0.034), (72, 0.052), (91, 0.06), (96, 0.24), (97, 0.011)]
simIndex simValue paperId paperTitle
1 0.95911682 136 acl-2011-Finding Deceptive Opinion Spam by Any Stretch of the Imagination
Author: Myle Ott ; Yejin Choi ; Claire Cardie ; Jeffrey T. Hancock
Abstract: Consumers increasingly rate, review and research products online (Jansen, 2010; Litvin et al., 2008). Consequently, websites containing consumer reviews are becoming targets of opinion spam. While recent work has focused primarily on manually identifiable instances of opinion spam, in this work we study deceptive opinion spam—fictitious opinions that have been deliberately written to sound authentic. Integrating work from psychology and computational linguistics, we develop and compare three approaches to detecting deceptive opinion spam, and ultimately develop a classifier that is nearly 90% accurate on our gold-standard opinion spam dataset. Based on feature analysis of our learned models, we additionally make several theoretical contributions, including revealing a relationship between deceptive opinions and imaginative writing.
2 0.92804664 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
Author: Alexander Volokh ; Gunter Neumann
Abstract: Annotated corpora are essential for almost all NLP applications. Whereas they are expected to be of a very high quality because of their importance for the followup developments, they still contain a considerable number of errors. With this work we want to draw attention to this fact. Additionally, we try to estimate the amount of errors and propose a method for their automatic correction. Whereas our approach is able to find only a portion of the errors that we suppose are contained in almost any annotated corpus due to the nature of the process of its creation, it has a very high precision, and thus is in any case beneficial for the quality of the corpus it is applied to. At last, we compare it to a different method for error detection in treebanks and find out that the errors that we are able to detect are mostly different and that our approaches are complementary. 1
same-paper 3 0.9077782 71 acl-2011-Coherent Citation-Based Summarization of Scientific Papers
Author: Amjad Abu-Jbara ; Dragomir Radev
Abstract: In citation-based summarization, text written by several researchers is leveraged to identify the important aspects of a target paper. Previous work on this problem focused almost exclusively on its extraction aspect (i.e. selecting a representative set of citation sentences that highlight the contribution of the target paper). Meanwhile, the fluency of the produced summaries has been mostly ignored. For example, diversity, readability, cohesion, and ordering of the sentences included in the summary have not been thoroughly considered. This resulted in noisy and confusing summaries. In this work, we present an approach for producing readable and cohesive citation-based summaries. Our experiments show that the pro- posed approach outperforms several baselines in terms of both extraction quality and fluency.
4 0.88602185 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models
Author: Jason Naradowsky ; Kristina Toutanova
Abstract: This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, morphological segmentation) while learning a morpheme segmentation over the target language. Our model outperforms a competitive word alignment system in alignment quality. Used in a monolingual morphological segmentation setting it substantially improves accuracy over previous state-of-the-art models on three Arabic and Hebrew datasets.
5 0.85182303 330 acl-2011-Using Derivation Trees for Treebank Error Detection
Author: Seth Kulick ; Ann Bies ; Justin Mott
Abstract: This work introduces a new approach to checking treebank consistency. Derivation trees based on a variant of Tree Adjoining Grammar are used to compare the annotation of word sequences based on their structural similarity. This overcomes the problems of earlier approaches based on using strings of words rather than tree structure to identify the appropriate contexts for comparison. We report on the result of applying this approach to the Penn Arabic Treebank and how this approach leads to high precision of error detection.
6 0.84917462 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
7 0.84665579 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
8 0.84596014 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework
9 0.84483165 117 acl-2011-Entity Set Expansion using Topic information
10 0.84457254 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports
11 0.84337997 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
12 0.84287339 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation
13 0.84282905 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model
14 0.84281635 53 acl-2011-Automatically Evaluating Text Coherence Using Discourse Relations
15 0.8427062 177 acl-2011-Interactive Group Suggesting for Twitter
16 0.84249949 76 acl-2011-Comparative News Summarization Using Linear Programming
17 0.84242284 133 acl-2011-Extracting Social Power Relationships from Natural Language
18 0.8422513 233 acl-2011-On-line Language Model Biasing for Statistical Machine Translation
19 0.8420862 187 acl-2011-Jointly Learning to Extract and Compress
20 0.84163082 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words