acl acl2010 acl2010-140 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vahed Qazvinian ; Dragomir R. Radev
Abstract: Identifying background (context) information in scientific articles can help scholars understand major contributions in their research area more easily. In this paper, we propose a general framework based on probabilistic inference to extract such context information from scientific papers. We model the sentences in an article and their lexical similarities as a Markov Random Field tuned to detect the patterns that context data create, and employ a Belief Propagation mechanism to detect likely context sentences. We also address the problem of generating surveys of scientific papers. Our experiments show greater pyramid scores for surveys generated using such context information rather than citation sentences alone.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Identifying background (context) information in scientific articles can help scholars understand major contributions in their research area more easily. [sent-2, score-0.205]
2 In this paper, we propose a general framework based on probabilistic inference to extract such context information from scientific papers. [sent-3, score-0.259]
3 We model the sentences in an article and their lexical similarities as a Markov Random Field tuned to detect the patterns that context data create, and employ a Belief Propagation mechanism to detect likely context sentences. [sent-4, score-0.31]
4 We also address the problem of generating surveys of scientific papers. [sent-5, score-0.337]
5 Our experiments show greater pyramid scores for surveys generated using such context information rather than citation sentences alone. [sent-6, score-0.987]
6 1 Introduction In scientific literature, scholars use citations to refer to external sources. [sent-7, score-0.542]
7 Previous work has shown the importance of citations in scientific domains and indicated that citations include survey-worthy information (Siddharthan and Teufel, 2007; Elkiss et al. [sent-9, score-0.852]
8 A citation to a paper in a scientific article may contain explicit information about the cited research. [sent-12, score-1.083]
9 We look at the patterns that such sentences create and observe that context sentences occur withing a small neighborhood of explicit citations. [sent-17, score-0.396]
10 We also discuss the problem of extracting context sentences for a source-reference article pair. [sent-18, score-0.229]
11 Finally we give evidence on how such sentences can help us produce better surveys of research areas. [sent-21, score-0.259]
12 2 Prior Work Analyzing the structure of scientific articles and their relations has received a lot of attention recently. [sent-29, score-0.178]
13 The structure of citation and collaboration networks has been studied in (Teufel et al. [sent-30, score-0.609]
14 , 2006; Newman, 2001), and summarization of scientific documents is discussed in (Teufel and Moens, 2002). [sent-31, score-0.214]
15 In addition, there is some previous work on the importance of citation sentences. [sent-32, score-0.609]
16 , 2008) perform a large-scale study on citations in the free PubMed Central (PMC) and show that they contain information that may not be present in abstracts. [sent-34, score-0.337]
17 , 2004a) analyze citation sentences and automatically categorize them in order to build a tool for survey generation. [sent-37, score-0.753]
18 The text of scientific citations has been used in previous research. [sent-38, score-0.515]
19 Bradshaw (Bradshaw, 2002; Bradshaw, 2003) uses citations to determine the content of articles. [sent-39, score-0.337]
20 Similarly, the text of citation sentences has been directly used to produce summaries of scientific papers in (Qazvinian and Radev, 2008; Mei and Zhai, 2008; Mohammad et al. [sent-40, score-0.98]
21 Determining the scientific attribution of an article has also been studied before. [sent-42, score-0.226]
22 Little work has been done on automatic citation extraction from research papers. [sent-44, score-0.609]
23 The mentioned work uses a machine learning method for extracting citations from research papers and evaluates the result using 4 annotated articles. [sent-47, score-0.43]
24 3 Data The ACL Anthology Network (AAN)2 is a collection of papers from the ACL Anthology3 published in the Computational Linguistics journal and proceedings from ACL conferences and workshops and includes more than 14, 000 papers over a period of four decades (Radev et al. [sent-53, score-0.186]
25 AAN includes the citation network of the papers in the ACL Anthology. [sent-55, score-0.748]
26 ato Trh teo alonnokot aateach explicit citation sentence, and read up to 15 sentences before and after, then mark context sentences around that sentence with 1s. [sent-83, score-1.059]
27 To calculate κ, we ignored all explicit citations (since they were provided to the external annotator) and used the binary categories (i. [sent-86, score-0.452]
28 First, we look at the number of explicit citations each reference has received in a paper. [sent-98, score-0.49]
29 It indicates that the majority of references get cited in only 1 sentence in a scientific article, while the maximum being 9 in our collected dataset with only 1 instance (i. [sent-100, score-0.365]
30 , there is only 1 reference that gets cited 9 times in a paper). [sent-102, score-0.171]
31 This highly skewed distribution indicates that the majority of references get cited only once in a citing paper. [sent-105, score-0.222]
32 210 a b Figure 1: (a) Histogram of the number of different citations to each reference in a paper. [sent-109, score-0.375]
33 (b) The distribution observed for the number of different citations on a log-log scale. [sent-110, score-0.337]
34 Next, we investigate the distance between context sentences and the closest citations. [sent-111, score-0.181]
35 For each context sentence, we find its distance to the closets context sentence or explicit citation. [sent-112, score-0.331]
36 Formally, we define the gap to be the number of sentences between a context sentence (marked with 1) and the closest context sentence or explicit citation (marked with either C or 1) to it. [sent-113, score-1.094]
37 For example, the second column of Table 2 shows that there is a gap of size 1in the 9th sentence in the set of context and citation sentences about Shinyama et al. [sent-114, score-0.844]
38 This observation suggests that the majority of context sentences directly occur after or before a citation or another context sentence. [sent-117, score-0.871]
39 However, it shows that gaps between sentences describing a cited paper actually exist, and a proposed method should have the capability to capture them. [sent-118, score-0.233]
40 4 Proposed Method In this section we propose our methodology that enables us to identify the context information of a cited paper. [sent-119, score-0.254]
41 Particularly, the task is to assign a binary label XC to each sentence Si from a paper S, where XC = 1 shows a context sentence related to a given cited paper, C. [sent-120, score-0.322]
42 Each hidden node xu, corresponding to an observed node yu, represents the true state underlying the observed value. [sent-126, score-0.214]
43 The state of a hidden node is related to the value of its corresponding observed node as well as the states of its neighboring hidden nodes. [sent-127, score-0.25]
44 Thus, t}he ∪ st naete( vo)f a n thoede c liso assumed to statistically depend only upon its hidden node and each of its neighbors, and independent of any other node in the graph given its neighbors. [sent-129, score-0.214]
45 The Potential function, φi (xc, yc), shows the statistical dependency between xc and yc at each node iassumed by the MRF model. [sent-133, score-0.185]
46 Elements that make up the message from a node ito another node j: messages from i’s neighbors, local evidence at i, and propagation function between i,j summed over all possible states of node i. [sent-140, score-0.463]
47 The message passed from ito j is proportional to the propagation function between i,j, the local evidence at i, and all messages sent to ifrom its neighbors except j: mij(xj) ← Xφi(xi)ψij(xi,xj) Xxi Y mki(xi) k∈nYe(i) \j Figure 2 illustrates the message update rule. [sent-142, score-0.33]
48 1 MRF construction To find the sentences from a paper that form the context information of a given cited paper, we build an MRF in which a hidden node xi and an observed node yi correspond to each sentence. [sent-150, score-0.582]
49 This assumption indicates that the generation of a sentence (in form of its words) only Figure 3: The structure of the MRF constructed based on the independence of non-adjacent sentences; (a) left, each sentence is independent on all other sentences given its immediate neighbors. [sent-152, score-0.208]
50 This local dependence assumption can result in a number of different MRFs, each built assuming a dependency between a sentence and all sentences within a particular distance. [sent-156, score-0.18]
51 Generally, we use BPi to denote an MRF in which each sentence is connected to i sentences before and after. [sent-160, score-0.18]
52 =i5j 1 Table 5: The compatibility function ψ between any two nodes in the MRFs from the sentences in scientific papers 4. [sent-163, score-0.456]
53 2 Compatibility Function The compatibility function of an MRF represents the association between the hidden node classes. [sent-164, score-0.21]
54 The belief of a node i, about its neighbor j to be in either classes is assumed to be 0. [sent-166, score-0.171]
55 In other words, if a node is not part of the context itself, we assume 559 it has no effect on its neighbors’ classes. [sent-168, score-0.17]
56 3 Potential Function The node potential function of an MRF can incorporate some other features observable from data. [sent-177, score-0.16]
57 Here, the goal is to find all sentences that are about a specific cited paper, without having explicit citations. [sent-178, score-0.348]
58 To build the node potential function of the observed nodes, we use some sentence level features. [sent-179, score-0.214]
59 First, we use the explicit citation as an important feature of a sentence. [sent-180, score-0.724]
60 This feature can affect the belief of the corresponding hidden node, which can in turn affect its neighbors’ beliefs. [sent-181, score-0.172]
61 For a given paper-reference pair, we flag (with a 1) each sentence that has an explicit citation to the reference. [sent-182, score-0.812]
62 Intuitively, if a sentence has higher similarity with the reference paper, it should have a higher potential of being in class 1 or C. [sent-188, score-0.164]
63 φi(xc,yc)x1c −= f 0ixcf=i 1 Table 6: The node potential function φ for each node in the MRFs from the sentences in scientific papers is built using the sentences’ flags computed using sentence level features. [sent-193, score-0.727]
64 Our methodology finds the sentences that cite a reference implicitly. [sent-195, score-0.208]
65 Therefore the output of the inference method is a vector, υ, of 1’s and 0’s, whereby a 1 at element i means that sentence iin the source document is a context sentence about the reference while a 0 means an explicit citation or neither. [sent-196, score-0.951]
66 This baseline, B1, takes explicit citations as an input but use them to find context sentences. [sent-202, score-0.533]
67 each sentence that is within a particular distance (4 in our experiments) of an explicit citation and matches one of the two patterns mentioned in Section 4. [sent-207, score-0.778]
68 After marking all such sentences, B2 also marks all sentences between them and the closest explicit citation, which is no farther than 4 sentences away. [sent-209, score-0.315]
69 , similarity to reference, explicit citation, matching certain regular-expressions) and a network level feature: distance to the closes explicit citation. [sent-215, score-0.305]
70 In BP4 locality is more relaxed and each sentence is connected to 4 sentences on each sides. [sent-221, score-0.18]
71 The first feature used to build the potential function is explicit citations. [sent-226, score-0.186]
72 This feature does not directly affect context sentences (i. [sent-227, score-0.208]
73 , it affects the marginal probability of context sentences through the MRF network connections). [sent-229, score-0.227]
74 Here we show how context sentences add important surveyworthy information to explicit citations. [sent-236, score-0.296]
75 Previous work that generate surveys of scientific topics use the text of citation sentences alone (Mohammad et al. [sent-237, score-1.046]
76 Here, we show how the surveys generated using citations and their context sentences are better than those generated using citation sentences alone. [sent-239, score-1.386]
77 that contains two sets of cited papers and corresponding citing sentences, one on Question Answering (QA) with 10 papers and the other on Dependency Parsing (DP) with 16 papers. [sent-287, score-0.408]
78 The QA set contains two different sets of nuggets extracted by experts respectively from paper abstracts and citation sentences. [sent-288, score-0.811]
79 The DP set includes nuggets extracted only from citation sentences. [sent-289, score-0.778]
80 For each citation sentence, BP4 is used on the citing paper to extract the proper context. [sent-292, score-0.698]
81 That is, we attach to a citing sentence any of its 4 preceding and following sentences if citation survey context survey QA CT nuggets 0. [sent-294, score-1.19]
82 379 Table 9: Pyramid Fβ=3 scores of automatic surveys of QA and DP data. [sent-300, score-0.159]
83 The QA surveys are evaluated using nuggets drawn from citation texts (CT), or abstracts (AB), and DP surveys are evaluated using nuggets from citation texts (CT). [sent-301, score-1.907]
84 Therefore, we build a new corpus in which each explicit citation sentence is replaced with the same sentence attached to at most 4 sentence on each side. [sent-303, score-0.886]
85 After building the context corpus, we use LexRank (Erkan and Radev, 2004) to generate 2 QA and 2 DP surveys using the citation sentences only, and the new context corpus explained above. [sent-304, score-1.03]
86 This example shows how context sentences add meaningful and survey-worthy information along with citation sentences. [sent-309, score-0.79]
87 The QA surveys are evaluated using nuggets drawn from citation texts (CT), or abstracts (AB), and DP surveys are evaluated using nuggets from citation texts (CT). [sent-311, score-1.907]
88 In all evaluation instances the surveys generated with the context corpora excel at covering nuggets drawn from abstracts or citation sentences. [sent-312, score-1.051]
89 7 Conclusion In this paper we proposed a framework based on probabilistic inference to extract sentences that appear in the scientific literature, and which are about a secondary source, but which do not contain explicit citations to that secondary source. [sent-313, score-0.806]
90 Our methodology is based on inference in an MRF built using the similarity of sentences and their lexical features. [sent-314, score-0.195]
91 We show, by numerical experiments, that an MRF in which each sentence is connected to only a few adjacent sentences properly fits this problem. [sent-315, score-0.18]
92 We also investigate the usefulness of such sentences in generating surveys of scientific literature. [sent-316, score-0.437]
93 Our experiments on generat- ing surveys for Question Answering and Dependency Parsing show how surveys generated using such context information along with citation sentences have higher quality than those built using citations alone. [sent-317, score-1.471]
94 Generating fluent scientific surveys is difficult in absence of sufficient background information. [sent-318, score-0.337]
95 Our future goal is to combine summarization and bibliometric techniques towards building automatic surveys that employ context information as an important part of the generated surveys. [sent-319, score-0.276]
96 Reference directed indexing: Redeeming relevance for subject search in citation indexes. [sent-331, score-0.609]
97 Blind men and elephants: What do citation summaries tell us about a research article? [sent-336, score-0.609]
98 Automatic extraction of citation contexts for re- search paper summarization: A coreference-chain based approach. [sent-345, score-0.609]
99 Classification of research papers using citation links and citation types: Towards automatic review article generation. [sent-388, score-1.359]
100 Summarizing scientific articles: experiments with relevance and rhetorical status. [sent-418, score-0.178]
wordName wordTfidf (topN-words)
[('citation', 0.609), ('citations', 0.337), ('mrf', 0.28), ('scientific', 0.178), ('nuggets', 0.169), ('surveys', 0.159), ('cited', 0.133), ('nanba', 0.118), ('explicit', 0.115), ('teufel', 0.101), ('qazvinian', 0.101), ('sentences', 0.1), ('qa', 0.1), ('xc', 0.096), ('papers', 0.093), ('node', 0.089), ('citing', 0.089), ('vahed', 0.084), ('belief', 0.082), ('context', 0.081), ('radev', 0.078), ('neighbors', 0.075), ('mohammad', 0.074), ('mrfs', 0.074), ('dp', 0.071), ('elkiss', 0.067), ('siddharthan', 0.063), ('dragomir', 0.059), ('message', 0.059), ('aan', 0.057), ('compatibility', 0.057), ('sentence', 0.054), ('xi', 0.054), ('metzler', 0.053), ('digital', 0.052), ('bradshaw', 0.051), ('hidetsugu', 0.051), ('romanello', 0.051), ('article', 0.048), ('lexrank', 0.048), ('network', 0.046), ('propagation', 0.044), ('survey', 0.044), ('mij', 0.044), ('marked', 0.044), ('potential', 0.043), ('simone', 0.041), ('manabu', 0.041), ('methodology', 0.04), ('secondary', 0.038), ('pyramid', 0.038), ('reference', 0.038), ('ct', 0.037), ('hidden', 0.036), ('summarization', 0.036), ('michigan', 0.036), ('flag', 0.034), ('bp', 0.034), ('xj', 0.034), ('anthology', 0.034), ('mcglohon', 0.034), ('okumura', 0.034), ('pradeep', 0.034), ('riloffand', 0.034), ('unes', 0.034), ('xv', 0.034), ('yedidia', 0.034), ('messages', 0.033), ('abstracts', 0.033), ('mei', 0.033), ('erkan', 0.033), ('kaplan', 0.033), ('ito', 0.032), ('libraries', 0.031), ('advaith', 0.03), ('cite', 0.03), ('eecs', 0.03), ('histogram', 0.03), ('shannon', 0.03), ('similarity', 0.029), ('eisner', 0.029), ('annotation', 0.028), ('function', 0.028), ('iis', 0.027), ('affect', 0.027), ('scholarly', 0.027), ('hirschman', 0.027), ('flags', 0.027), ('scholars', 0.027), ('uv', 0.027), ('thelen', 0.027), ('sigir', 0.027), ('donald', 0.027), ('dimensionality', 0.027), ('connected', 0.026), ('built', 0.026), ('annotator', 0.026), ('markov', 0.025), ('shinyama', 0.025), ('sigmoid', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 140 acl-2010-Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.
Author: Vahed Qazvinian ; Dragomir R. Radev
Abstract: Identifying background (context) information in scientific articles can help scholars understand major contributions in their research area more easily. In this paper, we propose a general framework based on probabilistic inference to extract such context information from scientific papers. We model the sentences in an article and their lexical similarities as a Markov Random Field tuned to detect the patterns that context data create, and employ a Belief Propagation mechanism to detect likely context sentences. We also address the problem of generating surveys of scientific papers. Our experiments show greater pyramid scores for surveys generated using such context information rather than citation sentences alone.
2 0.080259144 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization
Author: Shih-Hsiang Lin ; Berlin Chen
Abstract: In this paper, we formulate extractive summarization as a risk minimization problem and propose a unified probabilistic framework that naturally combines supervised and unsupervised summarization models to inherit their individual merits as well as to overcome their inherent limitations. In addition, the introduction of various loss functions also provides the summarization framework with a flexible but systematic way to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively. Experiments on speech summarization show that the methods deduced from our framework are very competitive with existing summarization approaches. 1
3 0.074004531 174 acl-2010-Modeling Semantic Relevance for Question-Answer Pairs in Web Social Communities
Author: Baoxun Wang ; Xiaolong Wang ; Chengjie Sun ; Bingquan Liu ; Lin Sun
Abstract: Quantifying the semantic relevance between questions and their candidate answers is essential to answer detection in social media corpora. In this paper, a deep belief network is proposed to model the semantic relevance for question-answer pairs. Observing the textual similarity between the community-driven questionanswering (cQA) dataset and the forum dataset, we present a novel learning strategy to promote the performance of our method on the social community datasets without hand-annotating work. The experimental results show that our method outperforms the traditional approaches on both the cQA and the forum corpora.
4 0.073396929 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
Author: Xiaojun Wan ; Huiying Li ; Jianguo Xiao
Abstract: Cross-language document summarization is a task of producing a summary in one language for a document set in a different language. Existing methods simply use machine translation for document translation or summary translation. However, current machine translation services are far from satisfactory, which results in that the quality of the cross-language summary is usually very poor, both in readability and content. In this paper, we propose to consider the translation quality of each sentence in the English-to-Chinese cross-language summarization process. First, the translation quality of each English sentence in the document set is predicted with the SVM regression method, and then the quality score of each sentence is incorporated into the summarization process. Finally, the English sentences with high translation quality and high informativeness are selected and translated to form the Chinese summary. Experimental results demonstrate the effectiveness and usefulness of the proposed approach. 1
5 0.058138572 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation
Author: Boxing Chen ; George Foster ; Roland Kuhn
Abstract: This paper proposes new algorithms to compute the sense similarity between two units (words, phrases, rules, etc.) from parallel corpora. The sense similarity scores are computed by using the vector space model. We then apply the algorithms to statistical machine translation by computing the sense similarity between the source and target side of translation rule pairs. Similarity scores are used as additional features of the translation model to improve translation performance. Significant improvements are obtained over a state-of-the-art hierarchical phrase-based machine translation system. 1
6 0.055901747 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
7 0.05569857 106 acl-2010-Event-Based Hyperspace Analogue to Language for Query Expansion
8 0.054520395 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm
9 0.054054178 189 acl-2010-Optimizing Question Answering Accuracy by Maximizing Log-Likelihood
10 0.053813465 264 acl-2010-Wrapping up a Summary: From Representation to Generation
11 0.052850701 125 acl-2010-Generating Templates of Entity Summaries with an Entity-Aspect Model and Pattern Mining
12 0.051372185 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web
13 0.051334456 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization
14 0.051299848 171 acl-2010-Metadata-Aware Measures for Answer Summarization in Community Question Answering
15 0.05061635 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
16 0.050417665 141 acl-2010-Identifying Text Polarity Using Random Walks
17 0.050151017 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages
18 0.049515259 39 acl-2010-Automatic Generation of Story Highlights
19 0.046844624 4 acl-2010-A Cognitive Cost Model of Annotations Based on Eye-Tracking Data
20 0.04577158 115 acl-2010-Filtering Syntactic Constraints for Statistical Machine Translation
topicId topicWeight
[(0, -0.151), (1, 0.043), (2, -0.054), (3, 0.01), (4, -0.01), (5, -0.006), (6, 0.02), (7, -0.069), (8, -0.036), (9, -0.01), (10, -0.042), (11, 0.001), (12, -0.024), (13, -0.012), (14, 0.023), (15, 0.048), (16, -0.033), (17, -0.002), (18, 0.018), (19, 0.042), (20, 0.033), (21, -0.027), (22, 0.045), (23, -0.024), (24, 0.031), (25, 0.014), (26, -0.029), (27, -0.021), (28, 0.029), (29, -0.043), (30, -0.028), (31, 0.014), (32, 0.003), (33, 0.042), (34, 0.012), (35, 0.067), (36, 0.008), (37, 0.018), (38, -0.057), (39, -0.041), (40, 0.035), (41, -0.067), (42, -0.019), (43, -0.039), (44, 0.069), (45, -0.02), (46, 0.02), (47, 0.069), (48, -0.055), (49, -0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.87280715 140 acl-2010-Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.
Author: Vahed Qazvinian ; Dragomir R. Radev
Abstract: Identifying background (context) information in scientific articles can help scholars understand major contributions in their research area more easily. In this paper, we propose a general framework based on probabilistic inference to extract such context information from scientific papers. We model the sentences in an article and their lexical similarities as a Markov Random Field tuned to detect the patterns that context data create, and employ a Belief Propagation mechanism to detect likely context sentences. We also address the problem of generating surveys of scientific papers. Our experiments show greater pyramid scores for surveys generated using such context information rather than citation sentences alone.
2 0.70482445 171 acl-2010-Metadata-Aware Measures for Answer Summarization in Community Question Answering
Author: Mattia Tomasoni ; Minlie Huang
Abstract: This paper presents a framework for automatically processing information coming from community Question Answering (cQA) portals with the purpose of generating a trustful, complete, relevant and succinct summary in response to a question. We exploit the metadata intrinsically present in User Generated Content (UGC) to bias automatic multi-document summarization techniques toward high quality information. We adopt a representation of concepts alternative to n-grams and propose two concept-scoring functions based on semantic overlap. Experimental re- sults on data drawn from Yahoo! Answers demonstrate the effectiveness of our method in terms of ROUGE scores. We show that the information contained in the best answers voted by users of cQA portals can be successfully complemented by our method.
3 0.63350761 174 acl-2010-Modeling Semantic Relevance for Question-Answer Pairs in Web Social Communities
Author: Baoxun Wang ; Xiaolong Wang ; Chengjie Sun ; Bingquan Liu ; Lin Sun
Abstract: Quantifying the semantic relevance between questions and their candidate answers is essential to answer detection in social media corpora. In this paper, a deep belief network is proposed to model the semantic relevance for question-answer pairs. Observing the textual similarity between the community-driven questionanswering (cQA) dataset and the forum dataset, we present a novel learning strategy to promote the performance of our method on the social community datasets without hand-annotating work. The experimental results show that our method outperforms the traditional approaches on both the cQA and the forum corpora.
4 0.62366992 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm
Author: Marina Litvak ; Mark Last ; Menahem Friedman
Abstract: Automated summarization methods can be defined as “language-independent,” if they are not based on any languagespecific knowledge. Such methods can be used for multilingual summarization defined by Mani (2001) as “processing several languages, with summary in the same language as input.” In this paper, we introduce MUSE, a languageindependent approach for extractive summarization based on the linear optimization of several sentence ranking measures using a genetic algorithm. We tested our methodology on two languages—English and Hebrew—and evaluated its performance with ROUGE-1 Recall vs. state- of-the-art extractive summarization approaches. Our results show that MUSE performs better than the best known multilingual approach (TextRank1) in both languages. Moreover, our experimental results on a bilingual (English and Hebrew) document collection suggest that MUSE does not need to be retrained on each language and the same model can be used across at least two different languages.
5 0.61195534 39 acl-2010-Automatic Generation of Story Highlights
Author: Kristian Woodsend ; Mirella Lapata
Abstract: In this paper we present a joint content selection and compression model for single-document summarization. The model operates over a phrase-based representation of the source document which we obtain by merging information from PCFG parse trees and dependency graphs. Using an integer linear programming formulation, the model learns to select and combine phrases subject to length, coverage and grammar constraints. We evaluate the approach on the task of generating “story highlights”—a small number of brief, self-contained sentences that allow readers to quickly gather information on news stories. Experimental results show that the model’s output is comparable to human-written highlights in terms of both grammaticality and content.
6 0.58301932 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization
7 0.57904142 189 acl-2010-Optimizing Question Answering Accuracy by Maximizing Log-Likelihood
8 0.55019128 196 acl-2010-Plot Induction and Evolutionary Search for Story Generation
9 0.52056831 264 acl-2010-Wrapping up a Summary: From Representation to Generation
10 0.51921523 248 acl-2010-Unsupervised Ontology Induction from Text
11 0.51548767 225 acl-2010-Temporal Information Processing of a New Language: Fast Porting with Minimal Resources
12 0.51366442 111 acl-2010-Extracting Sequences from the Web
13 0.50869423 186 acl-2010-Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two
14 0.50772715 139 acl-2010-Identifying Generic Noun Phrases
15 0.48975176 38 acl-2010-Automatic Evaluation of Linguistic Quality in Multi-Document Summarization
16 0.48971829 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
17 0.4895345 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices
18 0.48934892 165 acl-2010-Learning Script Knowledge with Web Experiments
19 0.482721 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web
20 0.47850034 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
topicId topicWeight
[(11, 0.281), (14, 0.028), (25, 0.043), (33, 0.017), (42, 0.038), (44, 0.023), (59, 0.07), (71, 0.011), (72, 0.02), (73, 0.062), (78, 0.057), (80, 0.011), (83, 0.087), (84, 0.025), (98, 0.133)]
simIndex simValue paperId paperTitle
same-paper 1 0.74003267 140 acl-2010-Identifying Non-Explicit Citing Sentences for Citation-Based Summarization.
Author: Vahed Qazvinian ; Dragomir R. Radev
Abstract: Identifying background (context) information in scientific articles can help scholars understand major contributions in their research area more easily. In this paper, we propose a general framework based on probabilistic inference to extract such context information from scientific papers. We model the sentences in an article and their lexical similarities as a Markov Random Field tuned to detect the patterns that context data create, and employ a Belief Propagation mechanism to detect likely context sentences. We also address the problem of generating surveys of scientific papers. Our experiments show greater pyramid scores for surveys generated using such context information rather than citation sentences alone.
2 0.73846555 36 acl-2010-Automatic Collocation Suggestion in Academic Writing
Author: Jian-Cheng Wu ; Yu-Chia Chang ; Teruko Mitamura ; Jason S. Chang
Abstract: In recent years, collocation has been widely acknowledged as an essential characteristic to distinguish native speakers from non-native speakers. Research on academic writing has also shown that collocations are not only common but serve a particularly important discourse function within the academic community. In our study, we propose a machine learning approach to implementing an online collocation writing assistant. We use a data-driven classifier to provide collocation suggestions to improve word choices, based on the result of classifica- tion. The system generates and ranks suggestions to assist learners’ collocation usages in their academic writing with satisfactory results. 1
3 0.56457448 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition
Author: Partha Pratim Talukdar ; Fernando Pereira
Abstract: Graph-based semi-supervised learning (SSL) algorithms have been successfully used to extract class-instance pairs from large unstructured and structured text collections. However, a careful comparison of different graph-based SSL algorithms on that task has been lacking. We compare three graph-based SSL algorithms for class-instance acquisition on a variety of graphs constructed from different domains. We find that the recently proposed MAD algorithm is the most effective. We also show that class-instance extraction can be significantly improved by adding semantic information in the form of instance-attribute edges derived from an independently developed knowledge base. All of our code and data will be made publicly available to encourage reproducible research in this area.
4 0.56046093 158 acl-2010-Latent Variable Models of Selectional Preference
Author: Diarmuid O Seaghdha
Abstract: This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to previously proposed techniques, these models perform very competitively, especially for infrequent predicate-argument combinations where they exceed the quality of Web-scale predictions while using relatively little data.
5 0.56006509 71 acl-2010-Convolution Kernel over Packed Parse Forest
Author: Min Zhang ; Hui Zhang ; Haizhou Li
Abstract: This paper proposes a convolution forest kernel to effectively explore rich structured features embedded in a packed parse forest. As opposed to the convolution tree kernel, the proposed forest kernel does not have to commit to a single best parse tree, is thus able to explore very large object spaces and much more structured features embedded in a forest. This makes the proposed kernel more robust against parsing errors and data sparseness issues than the convolution tree kernel. The paper presents the formal definition of convolution forest kernel and also illustrates the computing algorithm to fast compute the proposed convolution forest kernel. Experimental results on two NLP applications, relation extraction and semantic role labeling, show that the proposed forest kernel significantly outperforms the baseline of the convolution tree kernel. 1
6 0.55815196 251 acl-2010-Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews
7 0.55813193 70 acl-2010-Contextualizing Semantic Representations Using Syntactically Enriched Vector Models
8 0.55707455 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese
9 0.55702168 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features
10 0.55669075 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans
11 0.55667287 214 acl-2010-Sparsity in Dependency Grammar Induction
12 0.55507171 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
13 0.55480397 198 acl-2010-Predicate Argument Structure Analysis Using Transformation Based Learning
14 0.55361217 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification
15 0.55321455 93 acl-2010-Dynamic Programming for Linear-Time Incremental Parsing
16 0.55250508 116 acl-2010-Finding Cognate Groups Using Phylogenies
17 0.55247545 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation
18 0.5524351 39 acl-2010-Automatic Generation of Story Highlights
19 0.55197591 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries
20 0.55162799 185 acl-2010-Open Information Extraction Using Wikipedia