emnlp emnlp2013 emnlp2013-85 knowledge-graph by maker-knowledge-mining

85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts


Source: pdf

Author: Xian Qian ; Yang Liu

Abstract: Extractive summarization typically uses sentences as summarization units. In contrast, joint compression and summarization can use smaller units such as words and phrases, resulting in summaries containing more information. The goal of compressive summarization is to find a subset of words that maximize the total score of concepts and cutting dependency arcs under the grammar constraints and summary length constraint. We propose an efficient decoding algorithm for fast compressive summarization using graph cuts. Our approach first relaxes the length constraint using Lagrangian relaxation. Then we propose to bound the relaxed objective function by the supermodular binary quadratic programming problem, which can be solved efficiently using graph max-flow/min-cut. Since finding the tightest lower bound suffers from local optimality, we use convex relaxation for initialization. Experimental results on TAC2008 dataset demonstrate our method achieves competitive ROUGE score and has good readability, while is much faster than the integer linear programming (ILP) method.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu a Abstract Extractive summarization typically uses sentences as summarization units. [sent-5, score-0.653]

2 In contrast, joint compression and summarization can use smaller units such as words and phrases, resulting in summaries containing more information. [sent-6, score-0.526]

3 The goal of compressive summarization is to find a subset of words that maximize the total score of concepts and cutting dependency arcs under the grammar constraints and summary length constraint. [sent-7, score-1.117]

4 We propose an efficient decoding algorithm for fast compressive summarization using graph cuts. [sent-8, score-0.784]

5 Then we propose to bound the relaxed objective function by the supermodular binary quadratic programming problem, which can be solved efficiently using graph max-flow/min-cut. [sent-10, score-0.627]

6 Since finding the tightest lower bound suffers from local optimality, we use convex relaxation for initialization. [sent-11, score-0.267]

7 1 Introduction Automatic multi-document summarization helps readers get the most important information from large amounts of texts. [sent-13, score-0.311]

8 Extractive summarization casts the summarization task as a sentence selection problem: identifying important summary sentences from 1492 one or multiple documents. [sent-15, score-0.802]

9 Many methods have been developed in the past decades, including supervised approaches that use classifiers to predict summary sentences, graph based approaches to rank the sentences, and recent global optimization methods such as integer linear programming (Gillick et al. [sent-16, score-0.277]

10 Though extractive summarization is popular because of its simplicity and high readability, it has limitations in that it selects each sentence as a whole, and thus may miss informative partial sentences. [sent-18, score-0.547]

11 To improve the informativeness, joint compression and summarization was proposed (BergKirkpatrick et al. [sent-19, score-0.443]

12 , 2011), which uses words as summarization units, unlike extractive summarization where each sentence is a basic undecomposable unit. [sent-20, score-0.858]

13 To achieve better readability, manually defined grammar constraints or automatically learned models based on syntax trees are added during the summarization process. [sent-21, score-0.416]

14 Up to now, the state of the art compressive systems are based on integer linear programming (ILP). [sent-22, score-0.425]

15 Because ILP suffers from exponential complexity, word-based compression summarization is an order of magnitude slower than sentence-based extraction. [sent-23, score-0.478]

16 In this paper, we propose an efficient decoding algorithm for fast ILP based compressive summarization using graph cuts. [sent-33, score-0.784]

17 Our idea is to approximate the ILP as a binary quadratic programming problem where coefficients of all quadratic terms are nonnegative. [sent-37, score-0.362]

18 It is well known that such binary quadratic function is supermodular, and its maximum can be solved efficiently using graph max-flow/min-cut. [sent-38, score-0.27]

19 Hence the key is to find the coefficients of the super- modular binary quadratic function (SBQF) so that its maximum is close to the optimal ILP objective function. [sent-39, score-0.255]

20 First, we show that the subtree deletion model and grammar constraints can be eliminated by adding SBQFs to the objective function. [sent-41, score-0.403]

21 Second, we relax the summary length constraint using Lagrangian relaxation. [sent-42, score-0.274]

22 Since finding the tightest lower bound suffers from local optimality, we choose to use convex relaxation for initialization. [sent-44, score-0.301]

23 1 Extractive Summarization As our method is an approximation of ILP based method, we first briefly review the ILP based extractive summarization and compressive summarization. [sent-50, score-0.82]

24 This approach selects sentences so that the total score of language concepts appearing in the summary is maximized. [sent-55, score-0.329]

25 The association between the language concepts and sentences serves as the constraints, in addition to the summary length constraint. [sent-56, score-0.305]

26 E Sa,ch e concept cj i gs aamssosc (iGatieldli cwkit ahn a given score wj and a binary variable vj indicating if cj is selected in the summary. [sent-65, score-0.762]

27 Let njk denote the index of the sentence containing the kth occurrence of concept cj, and ln denote the length of sentence sn. [sent-66, score-0.299]

28 The ILP based extractive summarization system can be formulated as below: ∑J mya,vx s. [sent-67, score-0.503]

29 2 Compressive Summarization The quality of sentence-based extractive summarization is limited by the informativeness of the original sentences and the summary length constraint. [sent-80, score-0.74]

30 To remove the unimportant part from a long sentence, sentence compression is proposed to generate more informative summaries (Liu and Liu, 2009; Li et al. [sent-81, score-0.259]

31 Recent studies show that joint sentence compression and extraction, namely compressive summarization, outperforms pipeline systems that run extractive summarization on the compressed sentences or compress selected summary sentences (Martins and Smith, 2009; Berg-Kirkpatrick et al. [sent-83, score-1.221]

32 (201 1), compressive summarization integrates the concept model for extractive summariza- tion (Gillick and Favre, 2009) and subtree deletion model for sentence compression. [sent-86, score-1.221]

33 The score of a compressive summary consists of two parts, scores of selected concepts, and scores of the broken arcs in the dependency parse trees. [sent-87, score-0.532]

34 The selected words must satisfy the length constraint and grammar constraints that include subtree constraint and some manually defined hard constraints. [sent-88, score-0.455]

35 A compressive summary can be represented by a binary vector z, where zi indicates whether word xi is selected in the summary. [sent-99, score-0.636]

36 Let ahm denote the arc xh → xm in the dependency parse tree of the corresponding sentence containing words xh and xm, and A = {ahm} denote the set of dependency arcs. [sent-100, score-0.42]

37 The compressive summarization model can be formulated as an integer programming problem ∑J mza,vx ∑wj · vj+ ∑∈A wahmzh(1 − zm) ∑j=1 s. [sent-107, score-0.766]

38 vj =∪ ∏ ∪k i∏∈ojk ah∑m zi ∀j ∑zi ≤ L ∑i zh ≥ zm ∀ahm ∈ A zh = zm ∀ahm ∈ B z, v are binary (2) According to the subtree deletion model, the score of arc ahm is included if zh = 1and zm = 0, which can be formulated as wahm · zh(1 − zm). [sent-109, score-2.266]

39 The first constraint is similar to that in· zext(1rac −tiv ze summarization, that is, a concept is selected if and only if any of its occurrence is selected. [sent-110, score-0.226]

40 The third and fourth constraints are the subtree constraints and manually defined grammar constraints respectively. [sent-111, score-0.345]

41 Unlike extractive summarization where the scale of the problem (the number of sentences and concepts) is small, the number of variables in compressive summarization is linear in the number of words, which is usually thousands on the TAC datasets. [sent-114, score-1.192]

42 The basic idea of our method is to approximate the above optimization problem (2) by the supermodular binary quadratic programming (SBQP) problem: mzax ∑βizi+∑αijzizj ∑i s. [sent-121, score-0.428]

43 1 Formulate Grammar Constraints and Subtree Deletion Model by SBQF We show that the subtree deletion model can be for- mulated equivalently using SBQF. [sent-128, score-0.249]

44 First, we can eliminate the constraint zh ≥ zm by adding a penalty term to the objective funct≥ion z. [sent-129, score-0.642]

45 f(z) zh ≥ zm z is binary is equivalent to max s. [sent-132, score-0.592]

46 f(z) − ∞(1 − zh)zm z is binary We can see that the penalty term −∞(1 −zh)zm exWclued ceasn zh = 0, zm = 1l fyr toemrm th −e ∞fe(a1si−blez set, and for zh ≥ zm, both problems have the same objective functio≥n v zalue. [sent-134, score-0.739]

47 1495 Now we eliminate the third constraint in problem (2) using the penalized objective function described above. [sent-137, score-0.211]

48 vj ∑ (1− zh)zm ah∑m∈A =∪ ∏ ∪k i∏∈ojk zi ∀j (4) ∑zi ≤ L ∑i z, v are binary We can see that for each arc ahm, there must be a positive quadratic term +∞zhzm in the objective function, wuahdicraht guarantees thze supermodularity of the objective function, no matter what wahm is. [sent-140, score-0.671]

49 vj =∪ ∏ ∪k i∏∈ojk zi ∀j (5) λ ≥ 0 z, v are binary We solve the relaxed problem iteratively. [sent-148, score-0.392]

50 Otherwise, if ∑i zi > L, th∑en the current λ∑ ∑is too s∑mall, so we set∑ λmin = λ; otherwise, λ > 0 and ∑i zi < L, we set λmax = λ. [sent-160, score-0.286]

51 First we relax the first constraint of Problem (5) by bounding the objective function with a family of supermodular pseudo boolean functions (Boros and Hammer, 2002). [sent-167, score-0.344]

52 Though one can use techniques such as branchand-bound for exact inference (Qian and Liu, 2013; Gormley and Eisner, 2013), here for fast decoding, we use convex relaxation to choose a good seed p(0) . [sent-189, score-0.241]

53 To estimate such likelihood, we replace the binary con- straint in extractive summarization (Problem (1)) by 0 ≤ y, v ≤ 1, since solving a relaxed LP is much f0as ≤ter y th,van ≤ ≤IL 1P. [sent-192, score-0.617]

54 , Suppose y∗ gis a th reel optimal sisol mutuiocnh for such a relaxed LP problem, we initialize p by pjk=∑ykn∗yjnk∗jk (9) If for all k, yn∗jk = 0∑, then we initialize uniform distribution pjk using pjk=|o1j| where |oj | is the frequency of cj. [sent-193, score-0.257]

55 1497 Algorithm1CompressiveSummarizationvia Algorithm 1 Compressive Summarization via Graph Cuts Require: Scores of concepts {wj} and arcs {wahm}, max summary length nLc. [sent-200, score-0.413]

56 Ensure: Compressive summarization z∗, where zi indicates whether the ith word is selected. [sent-201, score-0.454]

57 Solve the relaxed LP of Problem (1) (replace the binary constraint by 0 ≤ y, v ≤ 1) to get y. [sent-202, score-0.203]

58 until convergence if ∑i zi > L then ∑λmi∑n = λ els∑e if ∑i zi < L then λma∑x = λ else break end if + end while 4 Features and Hard Constraints We choose discriminative models to learn the scores of concepts and arcs. [sent-207, score-0.449]

59 For concept cj, its score is wj = θcTonceptfconcept(cj) where fconcept (cj) is the feature vector of cj, and θconcept is the corresponding weight vector of feature fconcept (cj). [sent-208, score-0.308]

60 Similarly, score wahm is defined as wahm = θaTrcfarc(ahm) Though our algorithm can handle general word ngram concepts, we restrict the concepts to word bigrams, which have been widely used recently in the sentence-based ILP extractive summarization systems. [sent-209, score-0.829]

61 Sentence-title similarity: word unigram/bigrams cosine similarity between the sentence containing cj and the topic title. [sent-221, score-0.233]

62 For concepts appearing in multiple sentences, we choose the maximal similarity. [sent-222, score-0.196]

63 Sentence-query similarity: word unigram/bigram cosine similarity between the sentence containing cj and the topic query (concatenation of topic title and description). [sent-223, score-0.233]

64 For concepts appearing in multiple sentences, we choose the maximal similarity. [sent-224, score-0.196]

65 • • • Sentence position: position of the sentence containing cj itino nth:e document. [sent-225, score-0.233]

66 Paragraph starter: binary feature indicating Pwahreatghrearp cj appears in the first sentence of a paragraph. [sent-229, score-0.304]

67 For subtree deletion model, we define the following features for arc ahm. [sent-230, score-0.32]

68 x Dependency label of arc xh and child word xm ahm and its parent arc. [sent-232, score-0.323]

69 1498 We also define some hard constraints for subtree deletion to improve the readability of the generated compressed sentences. [sent-236, score-0.432]

70 This is based on our observation of the sentence compression corpus: removing preposition phrases (PP) or sub-clauses can greatly reduce the length of sentence, while hurting the readability little. [sent-243, score-0.287]

71 1 Experimental Setup Due to the lack of training data for compressive summarization, we learn the subtree deletion model and the concept model separately. [sent-246, score-0.674]

72 Specifically, the sentence compression dataset (Clarke and Lapata, 2008) (referred as CL08) is used for subtree deletion model training (θarc). [sent-247, score-0.425]

73 A sentence pair in the corpus is kept for training the subtree deletion model if the compressed sentence can be derived by deleting subtrees from the parse tree of the original sentence. [sent-248, score-0.395]

74 Training data consist of two parts, TAC2009 for learning the concept model, CL08 (Clarke and Lapata, 2008) for learning the subtree deletion model. [sent-254, score-0.357]

75 Remind that our algorithm is based on the assumption that scores of concepts are non-negative, ∀j, wj ≥ 0. [sent-259, score-0.229]

76 1 To control the contributions of the concept model and the subtree deletion model, we introduce a parameter and modify the original maximization problem (Problem 2) to: µ, ∑J mza,vx ∑wj · vj ∑j=1 +µ ∑∈A wahmzh(1 − zm) ah∑m We tune on TAC2010 dataset. [sent-265, score-0.528]

77 ROUGE has been widely used for summarization performance and can measure the informativeness of the summaries (content match between system and reference summaries). [sent-268, score-0.455]

78 Since joint compression and summarization tends to pick isolated words to maximize the information coverage in the system generated summaries, it may have poor readability. [sent-269, score-0.443]

79 For compressive summaries, we removed the sentences with grammar errors when evaluating coherence. [sent-277, score-0.399]

80 The overall linguistic quality score is the combined score of the percentage of grammatically correct sentences and the correct ordering of the summary sentences. [sent-278, score-0.228]

81 First, we build a standard ILP based compressive summarizer without the non-negative score assumption. [sent-282, score-0.348]

82 The second one is a greedy approach, where ext|roa|ctive summarization is obtained by greedy search (i. [sent-293, score-0.311]

83 For comparison, we also include the sentence-based ILP extractive summarization results. [sent-299, score-0.503]

84 The non-negativity assumption has very little effect on the standard compressive summarization (comparing the first two rows). [sent-302, score-0.659]

85 3 Results on Test Dataset Table 3 shows the summarization results for various systems on the TAC2008 data set. [sent-320, score-0.311]

86 We show both the summarization performance and the speed2 of the system. [sent-321, score-0.311]

87 The presented systems include our graphcut based method, the ILP based compression and summarization, and the sentence-based extractive summarization. [sent-322, score-0.324]

88 , 2008), the compressive summarization system of Berg-Kirkpatrick et al. [sent-326, score-0.628]

89 For our graph-cut based method, to study the tradeoff between the readability of the summary and the ROUGE scores, we present two versions for this method: one uses all the constraints (C0-C3), the other does not use C0. [sent-329, score-0.23]

90 Regarding the grammar constraints used in our system, from the two results for our graph-cut based method, we can see that adding constraint C0 significantly decreases the R-2 score but improves the language quality. [sent-387, score-0.225]

91 This shows that word-based joint compression and summarization can improve ROUGE score; however, we need to keep in mind about linguistic quality and find a tradeoff between the ROUGE score and the linguistic quality. [sent-388, score-0.474]

92 The results of our system and theirs showed that Lagrangian relaxation based method combined with combinatorial optimization algorithms such as dynamic programming or minimum cut can exploit the inner structure of problems and achieve significant speedup over ILP. [sent-390, score-0.266]

93 6 Conclusion In this paper, we propose a fast decoding algorithm for compressive summarization using graph cuts. [sent-397, score-0.784]

94 Our idea is to approximate the original ILP problem using supermodular binary quadratic programming (SBQP) problem. [sent-398, score-0.428]

95 Under the assumption that scores of concepts are non-negative, we eliminate subtree constraints and grammar constraints, and relax the length constraint and non-supermodular part of the problem step by step. [sent-399, score-0.639]

96 The other is to adapt it to social media text summarization task, where text is much more noisy. [sent-403, score-0.311]

97 As graph cut is a general method, applying it to other binary structured learning tasks is also an interesting direction. [sent-404, score-0.229]

98 Fast and robust compressive summarization with dual decomposition and multi-task learning. [sent-417, score-0.672]

99 Maximizing a supermodular pseudoboolean function: A polynomial algorithm for supermodular cubic functions. [sent-427, score-0.365]

100 From extractive to abstractive meeting summaries: Can it be done by sentence compression? [sent-509, score-0.236]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('compressive', 0.317), ('summarization', 0.311), ('zm', 0.303), ('ilp', 0.287), ('extractive', 0.192), ('cj', 0.189), ('supermodular', 0.166), ('zh', 0.158), ('zi', 0.143), ('compression', 0.132), ('subtree', 0.132), ('ahm', 0.13), ('concepts', 0.129), ('gillick', 0.123), ('deletion', 0.117), ('ojk', 0.116), ('sbqp', 0.116), ('relaxation', 0.111), ('concept', 0.108), ('vj', 0.105), ('summary', 0.105), ('quadratic', 0.1), ('pjk', 0.099), ('cut', 0.094), ('ah', 0.091), ('rouge', 0.091), ('constraint', 0.089), ('summaries', 0.083), ('wahm', 0.083), ('arcs', 0.079), ('lp', 0.074), ('binary', 0.071), ('arc', 0.071), ('readability', 0.071), ('xm', 0.069), ('wj', 0.069), ('lagrangian', 0.066), ('favre', 0.066), ('graph', 0.064), ('informativeness', 0.061), ('programming', 0.061), ('max', 0.06), ('compressed', 0.058), ('almeida', 0.058), ('constraints', 0.054), ('qian', 0.053), ('xh', 0.053), ('grammar', 0.051), ('convex', 0.05), ('awahmzh', 0.05), ('fconcept', 0.05), ('icsi', 0.05), ('ojkzi', 0.05), ('wahmzh', 0.05), ('objective', 0.049), ('martins', 0.047), ('integer', 0.047), ('fast', 0.046), ('decoding', 0.046), ('dual', 0.044), ('cuts', 0.044), ('sentence', 0.044), ('eliminate', 0.043), ('submodular', 0.043), ('clarke', 0.043), ('relaxed', 0.043), ('relax', 0.04), ('initialize', 0.04), ('length', 0.04), ('xian', 0.039), ('min', 0.038), ('bound', 0.038), ('tac', 0.038), ('maximization', 0.036), ('suffers', 0.035), ('solved', 0.035), ('optimal', 0.035), ('kth', 0.034), ('choose', 0.034), ('billionnet', 0.033), ('boros', 0.033), ('glpk', 0.033), ('nagano', 0.033), ('pseudoboolean', 0.033), ('sbqf', 0.033), ('sbqfs', 0.033), ('tightest', 0.033), ('eq', 0.033), ('pj', 0.033), ('appearing', 0.033), ('sentences', 0.031), ('assumption', 0.031), ('score', 0.031), ('grammatically', 0.03), ('problem', 0.03), ('lapata', 0.03), ('occurrence', 0.029), ('brandow', 0.029), ('aker', 0.029), ('chali', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

Author: Xian Qian ; Yang Liu

Abstract: Extractive summarization typically uses sentences as summarization units. In contrast, joint compression and summarization can use smaller units such as words and phrases, resulting in summaries containing more information. The goal of compressive summarization is to find a subset of words that maximize the total score of concepts and cutting dependency arcs under the grammar constraints and summary length constraint. We propose an efficient decoding algorithm for fast compressive summarization using graph cuts. Our approach first relaxes the length constraint using Lagrangian relaxation. Then we propose to bound the relaxed objective function by the supermodular binary quadratic programming problem, which can be solved efficiently using graph max-flow/min-cut. Since finding the tightest lower bound suffers from local optimality, we use convex relaxation for initialization. Experimental results on TAC2008 dataset demonstrate our method achieves competitive ROUGE score and has good readability, while is much faster than the integer linear programming (ILP) method.

2 0.36949798 65 emnlp-2013-Document Summarization via Guided Sentence Compression

Author: Chen Li ; Fei Liu ; Fuliang Weng ; Yang Liu

Abstract: Joint compression and summarization has been used recently to generate high quality summaries. However, such word-based joint optimization is computationally expensive. In this paper we adopt the ‘sentence compression + sentence selection’ pipeline approach for compressive summarization, but propose to perform summary guided compression, rather than generic sentence-based compression. To create an annotated corpus, the human annotators were asked to compress sentences while explicitly given the important summary words in the sentences. Using this corpus, we train a supervised sentence compression model using a set of word-, syntax-, and documentlevel features. During summarization, we use multiple compressed sentences in the integer linear programming framework to select . salient summary sentences. Our results on the TAC 2008 and 2011 summarization data sets show that by incorporating the guided sentence compression model, our summarization system can yield significant performance gain as compared to the state-of-the-art.

3 0.2549414 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

Author: Tengfei Ma ; Hiroshi Nakagawa

Abstract: Document summarization is an important task in the area of natural language processing, which aims to extract the most important information from a single document or a cluster of documents. In various summarization tasks, the summary length is manually defined. However, how to find the proper summary length is quite a problem; and keeping all summaries restricted to the same length is not always a good choice. It is obviously improper to generate summaries with the same length for two clusters of documents which contain quite different quantity of information. In this paper, we propose a Bayesian nonparametric model for multidocument summarization in order to automatically determine the proper lengths of summaries. Assuming that an original document can be reconstructed from its summary, we describe the ”reconstruction” by a Bayesian framework which selects sentences to form a good summary. Experimental results on DUC2004 data sets and some expanded data demonstrate the good quality of our summaries and the rationality of the length determination.

4 0.18336767 149 emnlp-2013-Overcoming the Lack of Parallel Data in Sentence Compression

Author: Katja Filippova ; Yasemin Altun

Abstract: A major challenge in supervised sentence compression is making use of rich feature representations because of very scarce parallel data. We address this problem and present a method to automatically build a compression corpus with hundreds of thousands of instances on which deletion-based algorithms can be trained. In our corpus, the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence supervised systems which require a structural alignment between the input and output can be successfully trained. We also extend an existing unsupervised compression method with a learning module. The new system uses structured prediction to learn from lexical, syntactic and other features. An evaluation with human raters shows that the presented data harvesting method indeed produces a parallel corpus of high quality. Also, the supervised system trained on this corpus gets high scores both from human raters and in an automatic evaluation setting, significantly outperforming a strong baseline. 1 Introduction and related work Sentence compression is a paraphrasing task where the goal is to generate sentences shorter than given while preserving the essential content. A robust compression system would be useful for mobile devices as well as a module in an extractive summarization system (Mani, 2001). Although a compression may differ lexically and structurally from the source sentence, to date most systems are extractive and proceed by deleting words from the 1481 input (Knight & Marcu, 2000; Dorr et al., 2003; Turner & Charniak, 2005; Clarke & Lapata, 2008; Berg-Kirkpatrick et al., 2011, inter alia). To decide which words, dependencies or phrases can be dropped, (i) rule-based approaches (Grefenstette, 1998; Jing & McKeown, 2000; Dorr et al., 2003; Zajic et al., 2007), (ii) supervised models trained on parallel data (Knight & Marcu, 2000; Turner & Charniak, 2005; McDonald, 2006; Gillick & Favre, 2009; Galanis & Androutsopoulos, 2010, inter alia) and (iii) unsupervised methods which make use of statistics collected from non-parallel data (Hori & Furui, 2004; Zajic et al., 2007; Clarke & Lapata, 2008; Filippova & Strube, 2008) have been investigated. Since it is infeasible to manually devise a set of accurate deletion rules with high coverage, recent research has been devoted to developing statistical methods and possibly augmenting them with a few linguistic rules to improve output readability (Clarke & Lapata, 2008; Nomoto, 2009). Supervised models. A major problem for supervised deletion-based systems is very limited amount of parallel data. Many approaches make use of a small portion of the Ziff-Davis corpus which has about 1K sentence-compression pairs1 . Other main sources of training data are the two manually crafted compression corpora from the University of Edinburgh (“written” and “spoken”, each approx. 1.4K pairs). Galanis & Androutsopoulos (201 1) attempt at getting more parallel data by applying a deletionbased compressor together with an automatic para1The method of Galley & McKeown (2007) could benefit from a larger number of sentences. Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t8ic1s–1491, phraser and generating multiple alternative compressions. To our knowledge, this extended data set has not yet been used for successful training of compression systems. Scarce parallel data makes it hard to go beyond a small set of features and explore lexicalization. For example, Knight & Marcu (2000) only induce nonlexicalized CFG rules, many of which occurred only once in the training data. The features of McDonald (2006) are formulated exclusively in terms of syntactic categories. Berg-Kirkpatrick et al. (201 1) have as few as 13 features to decide whether a constituent can be dropped. Galanis & Androutsopoulos (2010) use many features when deciding which branches of the input dependency tree can be pruned but require a reranker to select most fluent compressions from a pool of candidates generated in the pruning phase, many of which are ungrammatical. Even further data limitations exist for the algorithms which operate on syntactic trees and reformulate the compression task as a tree pruning one (Nomoto, 2008; Filippova & Strube, 2008; Cohn & Lapata, 2009; Galanis & Androutsopoulos, 2010, inter alia). These methods are sensitive to alignment errors, their performance degrades if the syntactic structure of the compression is very different from that of the input. For example, see Nomoto’s 2009 analysis of the poor performance of the T3 system of Cohn & Lapata (2009) when retrained on a corpus of loosely similar RSS feeds and news. Unsupervised models. Few approaches require no training data at all. The model of Hori & Furui (2004) combines scores estimated from monolingual corpora to generate compressions of transcribed speech. Adopting an integer linear programming (ILP) framework, Clarke & Lapata (2008) use hand-crafted syntactic constraints and an ngram language model, trained on uncompressed sentences, to find best compressions. The model of Filippova & Strube (2008) also uses ILP but the problem is formulated over dependencies and not ngrams. Conditional probabilities and word counts collected from a large treebank are combined in an ad hoc manner to assess grammatical importance and informativeness of dependencies. Similarly, Woodsend & Lapata (2010) formulate an ILP problem to gener- ate news story highlights using precomputed scores. 1482 Again, an ad hoc combination of the scores learned independently of the task is used in the objective function. Contributions of this paper. Our work is motivated by the obvious need for a large parallel corpus of sentences and compressions on which extractive systems can be trained. Furthermore, we want the compressions in the corpus to be structurally very close to the input. Ideally, in every pair, the compression should correspond to a subtree of the input. To this end, our contributions are three-fold: • • We describe an automatic procedure of constructing a parallel corpus ocf p 250,000 ese onften cocen-compression pairs such that the dependency tree of the compression is a subtree of the source tree. An evaluation with human raters demonstrates high quality of the parallel data in terms of readability and informativeness. We successfully apply the acquired data to train a neo svueclc supervised compression system ow thraicinh produces readable and informative compressions without employing a separate reranker. In particular, we start with the unsupervised method of Filippova & Strube (2008) and replace the ad hoc edge weighting with a linear function over a rich feature representation. The parameter vector is learned from our corpus specifically for the compression task using structured prediction (Collins, 2002). The new system significantly outperforms the baseline and hence provides further evidence for the utility of the parallel data. • We demonstrate that sparse lexical features are very eusmeofunls tfroart ese tnhtaetn spcea compression, aunreds st aharet a large parallel corpus is a requirement for applying them successfully. The compression framework we adopt and the unsupervised baseline are introduced in Section 2, the training algorithm for learning edge weights from parallel data is described in Section 3. In Section 4 we explain how to obtain the data and present an evaluation of its quality. In Section 5 we compare the baseline with our system and report the results of an experiment with humans as well as the results of an automatic evaluation. 2 Framework and baseline We adopt the unsupervised compression framework of Filippova & Strube (2008) as our baseline and extend it to a supervised structured prediction problem. In the experiments reported by Filippova & Strube (2008), the system was evaluated on the Edinburgh corpora. It achieved an F-score (Riezler et al., 2003) higher than reported by other systems on the same data under an aggressive compression rate and thus presents a competitive baseline. Tree pruning as optimization. In this framework, compressions are obtained by deleting edges of the source dependency structure so that (1) the retained edges form a valid syntactic tree, and (2) their total edge weight is maximized. The objective function is defined over set X = {xe, e ∈ E} of binary variables, corresponding {tox t,hee s∈et EE} o off t bhiesource edges, subject to the structural and length constraints, ×× = f(X) Xxe w(e) (1) Xe∈E Here, w(e) denotes the weight of edge e. This constrained optimization problem is solved under the tree structure and length constraints using ILP. If xe is resolved to 1, the respective edge is retained, otherwise it is deleted. The tree structure constraints enforce at most one parent for every node and structure connectivity (i.e., no disconnected subtrees). Given that length(node(e)) denotes the length of the node to which edge e points and α is the maximum permitted length for the compression, the length constraint is simply Xxe Xe∈E length(node(e)) ≤ α (2) Word limit is used in the original paper, whereas we use character length which is more appropriate for system comparisons (Napoles et al., 2011). If uniform weights are used in Eq. (1), the optimal solution would correspond to a subtree covering as many edges as possible while keeping the compression length under given limit. The solution to the surface realization problem (Belz et al., 2011) is standard: the words in the compression subtree are put in the same order they are found in the source. 1483 Due to space limitations, we refer the reader to (Filippova & Strube, 2008) for a detailed description on the method. Essential for the present discussion is that source dependency trees are transformed to dependency graphs in that (1) auxiliary, determiner, preposition, negation and possessive nodes are collapsed with their heads; (2) prepositions replace labels on the edges to their arguments; (3) the dummy root node is connected with every inflected verb. Figures 1(a)-1(b) illustrate most of the transformations. The transformations are deterministic and reversible, they can be implemented in a single top-down tree traversal2. The set E of edges in Eq. (1) is thus the set of edges of the transformed dependency graph, like in Fig. 1(b). A benefit of the transformations is that function words and negation appear in the compression if and only if their head words are present. Hence no separate constraints are required to en- × sure that negation or a determiner is preserved. The dummy root node makes constraint formulation easier and also allows for the generation of compressions from any finite clause of the source. The described pruning optimization framework is used both for the unsupervised baseline and for our supervised system. The difference between the baseline and our system is in how edge weights, w(e)’s in Eq. (1), are instantiated. Baseline edge weights. The precomputed edge weights reflect syntactic importance as well as informativeness of the nodes they point to. Given edge e from head node h to node n, the edge weight is the product of the syntactic and the informativeness weights, w(e) = wsynt(e) winfo(e) (3) The syntactic weight is defined as wsynt(e) = P(label(e) |lemma(h)) (4) For example, verb kill may have multiple arguments realized with dependency labels subj, dobj, in, etc. However, these argument labels are not equally likely, e.g., P(subj|kill) > P(in|kill) . When forced to prune an edge, t|hkiel system iwno|kuillld) prefer to keep 2Some of the transformations are comparable to what is implemented in the Stanford parser (de Marneffe et al., 2006). pspos prepnspoub j ro tdetamodc ompnsuabujxpas prep doebtjamodprep oabmjod ro tBritan’sMinsrto ryotfsoufbDjef nrsoeotsaysBritsha(bm)ocadcTosrmoaplndsifeor smubwejdagsrkapihledroadisnidaemoidnablast outheinr amiondAfghanistan ro tBritshamodrao stoldiersubjwas kiledinin a blastin Afghanistan Britain ’s Ministry of Defense says a British soldier was killed in a roadside blast in southern Afghanistan (a) Source dependency tree (c) Tree of extracted headline A British soldier was killed in a blast in Afghanistan detamodsubajuxpassrootpreppodbejtpreppobj A British soldier was killed in a blast in Afghanistan (d) Tree of extracted headline with transformations undone Figure 1: Source, transformed and extracted trees given headline British soldier killed in Afghanistan the subject edge over the preposition-in edge since it contributes more weight to the objective function. The informativeness score is inspired by Wood- send & Lapata (2012) and is defined as winfo(e) =PPhaeartdicl iene((lleemmmmaa((nn)))) (5) This weight tells us how likely it is that a word from an article appears in the headline. For exam- ple, given two edges one of which points to verb say and another one to verb kill, the latter would be preferred over the former because kill is more “headliny” than say. When collecting counts for the syntactic and informativeness scores, we used 9M news articles crawled from the Internet, much more than Filippova & Strube (2008). As a result our estimates are probably more accurate than theirs. Although both wsynt and winfo have a meaningful interpretation, there is no guarantee that product is the best way to combine the two when assigning edge weights. Also, it is unclear how to integrate other signals, such as distance to the root, node length or information about the siblings, which pre1484 sumably all play a role in determining the overall edge importance. 3 Learning edge weights Our supervised system differs from the unsupervised baseline in that instead of relying on precomputed scores, we define edge weight w(e) in Eq. (1) with a linear function over a feature representation, w(e) = w · f(e) (6) Here f(e) is a vector of binary variables for every feature from the set of all possible but very infrequent features in the training set. f(e) has 1for every feature extracted for edge e and zero otherwise. Table 1 gives an overview of the feature types we use (edge e points from head h to node n). Note that syntactic, structural and semantic features are closed-class. For all the structural features but char length, seven is used as maximum possible value; all possible character lengths are bucketed into six classes. All the features are local for a given edge, contextual information is included about – syntacticlabel(e); for e* to h, label(e*); pos(h); pos(n) structural depth(n); #children(n); #children(h); char length(n); #words in(n) semantic NE tag(h); NE tag(n); is negated(n) lexical lemma(n); lemma(h)-label(e); for e* to n’s siblings, lemma(h)-label(e*) Table 1: Types of features extracted for edge e from h to n the head and the target nodes, and the siblings as well as the children of the latter. The negation feature is only applicable to verb nodes which contain a negative particle, like not, after the tree transformations. Lexical features which combine lemmas and syntactic labels are inspired by the unsupervised baseline and are very sparse. In what follows, our assumption is that we have a compression corpus at our disposal where for every input sentence there is a correct “oracle” compression such that its transformed parse tree matches a subtree of the transformed input graph. Given such a corpus, we can apply structured prediction methods to learn the parameter vector w. In our study we employ an averaged variant of online structured perceptron (Collins, 2002). In the context of sentence fusion, a similar dependency structure pruning framework and a similar learning approach was adopted by Elsner & Santhanam (201 1). At every iteration, for every input graph, we find the optimal solution with ILP under the current parameter vector w. The maximum permitted compression length is set to be the same as the length of the oracle compression. Since the oracle compression is a subtree of the input graph, it represents a feasible solution for ILP. The parameter vector is updated if there is a mismatch between the predicted and the oracle sets of edges for all the features with a non-zero net count. More formally, given an input graph with the set of edges E, oracle compression C ⊂ E and compression Ct ⊆ E predicted at itera- tCion ⊂ ⊂t , t ahen parameter update ⊆vec Etor p raetd ti + 1d aist given by wt+1 = wt+ e∈XC\Ct f(e) −X f(e) X e∈XCt\C (7) w is averaged over all the wt’s so that features whose weight fluctuated a lot during training are penalized (Freund & Shapire, 1999). 1485 Of course, training a model with a large number of features, such as a lexicalized model, is only possible if there is a large compression corpus where the dependency tree of the compression is a subtree of the source sentence. In the next section we introduce our method of getting a sufficient amount of such data. 4 Acquiring parallel data automatically In this section we explain how we obtained a parallel corpus of sentences and compressions. The underlying idea is to harvest news articles from the Internet where the headline appears to be similar to the first sentence and use it to find an extractive compression of the sentence. Collecting headline-sentence pairs. Using a news crawler, we collected a corpus of news articles in English from the Internet. Similarly to previous work (Dolan et al., 2004; Wubben et al., 2009; Bejan & Harabagiu, 2010, inter alia), the Google News service3 was used to identify news. From every article, the headline and the first sentence, which are known to be semantically similar (Dorr et al., 2003), were extracted. Predictably, very few headlines are extractive compressions of the first sentence, therefore simply looking for pairs where the headline is a subsequence of the words from the first sentence would not solve the problem of getting a large amount of parallel data. Importantly, headlines are syntactically quite different from “normal” sentences. For example, they may have no main verb, omit determiners and appear incomplete, making it hard for a supervised deletion-based system to learn useful rules. Moreover, we observed poor parsing accuracy for headlines which would make syntactic annotations for headlines hardly useful. Thus, instead oftaking the headline as it is, we use it to find a proper extractive compression of the sen3http : / /news .google . com, Jan-Dec 2012. tence by matching lemmas of content words (nouns, verbs, adjectives, adverbs) and coreference IDs of entities from the headline with those of the sentence. The exact procedure is as follows (H, S and T stand for headline, sentence and transformed graph of the sentence): PREPROCESSING H and S are preprocessed in a standard way: tokenized, lemmatized, PoS and NE tagged. Additionally, S is parsed with a dependency parser (Nivre, 2006) and transformed as described in Section 2 to obtain T. Finally, pronominal anaphora is resolved in S. Recall that S is the first sentence, so the antecedent must be located in a preceding, higher-level clause. FILTERING To restrict the corpus to grammatical and informative headlines, we implemented a cascade of filters. Pair (H, S) is discarded if any of the questions in Table 2 is answered positively. Is H a question? Is H or S too short? (less than four word tokens) Is H about as long as S? (min ratio: 1.5) Does H lack a verb? Does H begin with a verb? Is there a noun, verb, adj, adv lemma from H not found in S? Are the noun, verb, adj, adv lemmas from H found in S in a different order? Table 2: Filters applied to candidate pair (H , S) MATCHING Given the content words of H, a subset of nodes in T is selected based on lemma or coreference identity of the main (head) word in the nodes. For example, the main word of a collapsed node in T, which covers two words was killed, is killed; was is its child attached with label aux in the untransformed parse tree. This node is marked if H contains word killed or killing because of the lemma identity. In some cases there are multiple possible matches. For example, given S Barack Obama said he will attend G20 and H mentioning Obama, both Barack Obama and he nodes are marked in T. Once all the nodes in T which match content words and entities from H are identified, a minimum subtree covering these nodes is found such that every word or entity from H occurs as many times in T as in 1486 H. So if H mentions Obama only once, then either Barack Obama or he must be covered by the subtree but not both. This minimum subtree corresponds to an extractive headline, H*, which we generate by ordering the surface forms of all the words in the subtree nodes by their offsets in S. Finally, the character length of H* is compared with the length of H. If H* is much longer than H, the pair (H S) is discarded (max ratio 1.5). As an illustration to the procedure, consider the example from Figure 1 with the extracted headline and its tree presented in Figure 1(c). Given the headline British soldier killed in Afghanistan, the extracted headline would be A British soldier was killed in a blast in Afghanistan. The lemmas british, soldier, kill, afghanistan from the headline match the nodes British, a soldier, was killed, in Afghanistan in the transformed graph. The node in a blast is added because it is on the path from was killed to in Afghanistan. Of course, it is possible to determinis- , tically undo the transformations in order to obtain a standard dependency tree. In this case the extracted headline would still correspond to a subtree of the input (compare Fig. 1(d) with Fig. 1(a)). Also note that a similar procedure can be implemented for constituency parses. The resulting corpus consists of 250K tuples (S T H H*), Appendix provides more examples of source sentences, original headlines and extracted headlines. We did not attempt to tune the values for minimum/maximum length and ratio lower thresholds may have produced comparable results. , , , – Evaluating data quality. The described procedure produces a comparatively large compression corpus but how good are automatically constructed compressions? To answer this question, we randomly selected 50 tuples from the corpus and set up an experiment with human raters to validate and assess data quality in terms of readability4 and informativeness5 which are standard measures of compression quality (Clarke & Lapata, 2006). Raters were asked to read a sentence and a compression (original H or extracted H* headline) and then rate the compression on two five-point scales. Three rat- ings were collected for every item. Table 3 gives 4Also called grammaticality and fluency. 5Also called importance and representativeness. average ratings with standard deviation. AVG read AVG info ORIG. HEADLINE4.36 (0.75)3.86 (0.79) EXTR. HEADLINE 4.26 (1.01) 3.70 (1.04) Table 3: Results for two kinds of headlines In terms of readability and informativeness the extracted headlines are comparable with humanwritten ones: at 95% confidence there is no statistically significant difference between the two. Encouraged by the results of the validation experiment we proceeded to our next question: Can a supervised compression system be successfully trained on this corpus? 5 System evaluation and discussion From the corpus of 250K tuples we used 100K to get pairs of extracted headlines and sentences for training (on the development set we did not observe much improvement from using more training data), 250 for development and the rest for testing. We ran the learning algorithm for 20 iterations, checking the performance on the development set. Features which applied to less than 20 edges were pruned, the size of the feature set is about 28K. 5.1 Evaluation with humans 50 pairs of original headlines and sentences (different from the data validation set in Sec. 4) were randomly selected for an evaluation with humans from the test data. As in the data quality validation experiment, we asked raters to assess the readability and informativeness of proposed compressions for the unsupervised system, our system and humanwritten headlines. The latter provide us with upper bounds on the evaluation criteria. Three ratings per item per parameter were collected. To get comparable results, the unsupervised and our systems used the same compression rate: for both, the requested maximum length was set to the length of the headline. Table 4 summarizes the results. The results indicate that the trained model significantly outperforms the unsupervised system, getting particularly good marks for readability. The differ- ence in readability between our system and original headlines is not statistically significant. Note that 1487 O U RNISRGU. SPY H.S ETYAESMDTLEIMNE4A43V. 376G0 6† read32A4.V. 571G20 † i‡nfo Table 4: Results for the systems and original headline: † and ‡ stand for significantly better than Unsupervised and Our system at 95% confidence, respectively the unsupervised baseline is also capable of generating readable compressions but does a much poorer job in selecting most important information. Our trained model successfully learned to optimize both scores. We refer the reader to Appendix for input and compression examples. Note that the ratings for the human-written headlines in this experiment are slightly different from the ratings in the data validation experiment because a different data sample was used. 5.2 Automatic evaluation Our automatic evaluation had the goal of explicitly addressing two relevant questions related to our claims about (1) the benefits of having a large parallel corpus and (2) employing a supervised approach with a rich feature representation. 1. Our primary motivation for collecting parallel data has been that having access to sparse lexical features, which considerably increase the feature space, would benefit compression systems. But is it really the case for sentence compression? Can a comparable performance be achieved with a closed, moderately sized set of dense, non-lexical features? If yes, then a large compression corpus is probably not needed. Furthermore, to demonstrate that a large corpus is not only sufficient but also necessary to learn weights for thousands of features, we need to compare the performance of the system when trained on the full data set and a small portion of it. 2. The syntactic and informativeness scores in Eq. (3) were calculated over millions of news articles and do provide us with meaninful statistics (see Sec. 2). Is there any benefit in replacing those scores with weights learned for their feature counterparts? Recall that one of our feature types in Table 1 is the concatenation of lemma(h) (parent lemma) and label(e) which relies on the same information as wsynt = P(label(e) |lemma(h)). The feature counterpart bofe winfo dmemfinae(dh i))n. Eq. (5) aislemma(n)–the lemma of the node to which edge points. How would the supervised system perform against the unsupervised one, if it only extracted features of these two types? To answer these questions, we sampled 1,000 tuples from the unused test data and measured F1 score (Riezler et al., 2003) by comparing the trees of the generated compression and the “correct”, extracted headline. The systems we compared are the unsupervised baseline (UNSUP. SYSTEM) and the supervised model trained on three kinds of feature sets: (1) SYNT-INFO FEATURES, corresponding to the supervised training of the unsupervised baseline model (i.e., lemma(h)-label(e) and lemma(n)); (2) NON-LEX FEATURES, corresponding to a dense, non-lexical feature representation (i.e., all the feature types from Table 1 excluding the three involving lemmas); (3) ALL FEATURES (same as OUR SYSTEM). Additionally, we trained the system on 10% of the data–10K as opposed to 100K tuples, ALL FEATURES ( 10K)–for 20 iterations ignoring features which applied to less than three edges6. As before, the same compression rate was used for all the systems. The results are summarized in Table 5. SANUYL ONSLTUF-LEPI.NASXTFYUOFSRE TAE STAMU(1R0ESK)F1578249s1c.0o364re#f12a7tN,u4835r.1A29e30s. Table 5: Results for the unsupervised baseline and the supervised system trained on three kinds of feature sets Clearly, having more features, lexicalized and unlexicalized, is important: there is a significant im6Recall from the beginning of the section that for the full (100K) training set the threshold was set to 20 with no tuning. For the 10K training set, we tried values of two, three, five and varied the number of iterations. The result we report is the highest we could get for 10K. 1488 provement in going beyond the closed set of 330 non-lexical features to all, from 79.6 to 84.3 points. Moreover, successful training requires a large corpus since the performance of the system degrades if only 10K training instances are used. Note that this number already exceeds all the existing compression corpora taken together. Hence, sparse lexical features are useful for compression and a large parallel corpus is a requirement for successful supervised training. Concerning our second question, learning feature weights from the data produces significantly better results than the hand-crafted way of making use of the same information, even if a much larger data set is used to collect statistics. We observed a dramatic increase from 52.3 to 75.0 points. Thus, we may conclude that training with dense and sparse features directly from data definitely improves the performance of the dependency pruning system. 5.3 Discussion It is important to note that the data we used is challenging: first sentences in news articles tend to be long, in fact longer than other news sentences, which implies less reliable syntactic analysis and noisier input to the syntax-based systems. In the test set we used for the evaluation with humans, the mean sentence length is 165 characters. The average compression rate in characters is 0.46 0. 16 which is quite iaogng rraetsesiv ine7 c. hRareaccatlel rtsha ist we u6se ±d 0th.1e6 very same framework for the unsupervised baseline and our system as well as the same compression rate. All the preprocessing errors affect both systems equally and the comparison of the two is fair. Predictably, wrong syntactic parses significantly increase chances of an ungrammatical compression, and parser errors seem to be a major source of readability deficiencies. A property of the described compression framework is that a desired compression length is expected to be provided by the user. This can be seen both as a strength and as a weakness, depending on the application. In a scenario where mobile devices with a limited screen size are used, or in a summarization scenario where a total summary length is ± provided (see the DUC/TAC guidelines8), being able 7We follow the standard terminology where smaller values imply shorter compressions. 8http : / /www .nist .gov/t ac / to specify a length is definitely an advantage. However, one can also think of other applications where the user does not have a strict length constraint but wants the text to be somewhat shorter. In this case, a reranker which compares compressions generated for a range of possible lengths can be employed to find a single compression (e.g., mean edge weight in the solution or a language model-based score). 6 Conclusions We have addressed a major problem for supervised extractive compression models the lack of a large parallel corpus. To this end, we presented a method to automatically build such a corpus from web documents available on the Internet. An evaluation with humans demonstrates that the quality of the corpus is high the compressions are grammatical and informative. We also significantly improved a competitive unsupervised method achieving high readability and informativeness scores by incorpo– – rating thousands of features and learning the feature weights from our corpus. This result further confirms the practical utility of the automatically obtained data. We have shown that employing lexical features is important for sentence compression, and that our supervised module can successfully learn their weights from the corpus. To our knowledge, we are the first to empirically demonstrate that sparse features are useful for compression and that a large parallel corpus is a requirement for a successful learning of their weights. We believe that other supervised deletion-based systems can benefit from our work. Acknowledgements: The authors are thankful to the EMNLP reviewers for their feedback and suggestions. Appendix The appendix presents examples of source sentences (S), original headlines (H), extracted headlines (H*), unsupervised baseline (U) and our system (O) compressions. 1489 References Bejan, C. & S. Harabagiu (2010). Unsupervised event coreference resolution with rich linguistic features. In Proc. of ACL-10, pp. 1412–1422. Belz, A., M. White, D. Espinosa, E. Kow, D. Hogan & A. Stent (201 1). The first surface realization shared task: Overview and evaluation results. In Proc. of ENLG-11, pp. 217–226. Berg-Kirkpatrick, T., D. Gillick & D. Klein (201 1). Jointly learning to extract and compress. In Proc. of ACL-11. Clarke, J. & M. Lapata (2006). Models for sentence compression: A comparison across domains, training requirements and evaluation measures. In Proc. of COLING-ACL-06, pp. 377–385. Clarke, J. & M. Lapata (2008). Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research, 3 1:399–429. Cohn, T. & M. Lapata (2009). Sentence compres- sion as tree transduction. Journal of Artificial Intelligence Research, 34:637–674. Collins, M. (2002). Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP-02, pp. 1–8. de Marneffe, M.-C., B. MacCartney & C. D. Manning (2006). Generating typed dependency parses from phrase structure parses. In Proc. of LREC06, pp. 449–454. Dolan, B., C. Quirk & C. Brokett (2004). Unsupervised construction oflarge paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004, pp. 350–356. Dorr, B., D. Zajic & R. Schwartz (2003). Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the Text Summarization Workshop at HLT-NAACL-03, Edmonton, Alberta, Canada, 2003, pp. 1–8. SCountry star Sara Evans has married former University of Alabama quarterback Jay Barker. H H* U O Country star Sara Evans marries Country star Sara Evans has married Sara Evans has married Jay Barker Sara Evans has married Jay Barker H H* U O Intel to build car batteries Intel would be building car batteries would be building the company said Intel would be building car batteries SIntel would be building car batteries, expanding its business beyond its core strength, the company said in a statement SA New Orleans Saints team spokesman says tight end Jeremy Shockey was taken to a hospital but is doing fine. H H* U O Spokesman: Shockey taken to hospital, doing fine spokesman says Jeremy Shockey was taken to a hospital but is doing fine A New Orleans Saints team spokesman says Jeremy Shockey was taken tight end Jeremy Shockey was taken to a hospital but is doing fine SPresident Obama declared a major disaster exists in the State of Florida and ordered Federal aid to supplement H H* U O State and beginning President President President President local recovery efforts in the area struck by severe storms, flooding, tornadoes, and straight-line winds on May 17, 2009, and continuing. Obama declares major disaster exists in the State of Florida Obama declared a major disaster exists in the State of Florida Obama declared a major disaster exists and ordered Federal aid Obama declared a major disaster exists in the State of Florida H H* U O mounting loan defaults. Regulators shut down small Florida bank Regulators shut down a small Florida bank shut down bringing the number of failures Regulators shut down a small Florida bank SRegulators Friday shut down a small Florida bank, bringing to 119 the number of US bank failures this year amid SThree men were arrested Wednesday night and Dayton police said their arrests are in connection to a west Dayton H H* U O bank robbery. 3 men arrested in connection with Bank robbery Three men were arrested are in connection to a bank robbery were arrested and Dayton police said their arrests are Three men were arrested and police said their arrests are SThe government and the social partners will resume the talks on the introduction of the so-called crisis tax, H H* U O which will be levied on all salaries, pensions and incomes over HRK 3,000. Government, social partners to resume talks on introduction of “crisis” tax. The government and the social partners will resume the talks on the introduction of the crisis tax The government will resume the talks on the introduction of the crisis tax which will be levied The government and the social partners will resume the talks on the introduction of the crisis tax SEngland star David Beckham may have the chance to return to AC Milan after the Italian club’s coach said H H* U O he was open to his move on Sunday. Beckham has chance of returning to Milan David Beckham may have the chance to return to AC Milan David Beckham may have the chance to return said star was David Beckham may have the chance to return to AC Milan SEastern Health and its insurance company have accepted liability for some patients involved in the breast cancer H H* U O testing scandal, according to a statement released Friday afternoon. Eastern Health accepts liability for some patients Eastern Health have accepted liability for some patients Health have accepted liability according to a statement Eastern Health have accepted liability for some patients SFrontier Communications Corp., a provider of phone, TV and Internet services, said Thursday H H* U it has started a cash tender offer to purchase up to $700 million of its notes. Frontier Communications starts tender offer for up to $700 million of notes Frontier Communications has started a tender offer to purchase $700 million of its notes Frontier Communications said Thursday a provider has started a tender offer OFrontier Communications has started a tender offe1r4 to9 p0urchase $700 million of its notes Elsner, M. & D. Santhanam (201 1). Learning to fuse disparate sentences. In Proceedings of the Workshop on Monolingual Text-to-text Generation, Prtland, OR, June 24 2011, pp. 54–63. Filippova, K. & M. Strube (2008). Dependency tree based sentence compression. In Proc. of INLG08, pp. 25–32. Freund, Y. & R. E. Shapire (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37:277–296. Galanis, D. & I. Androutsopoulos (2010). An extractive supervised two-stage method for sentence compression. In Proc. of NAACL-HLT-10, pp. 885–893. Galanis, D. & I. Androutsopoulos (201 1). A new sentence compression dataset and its use in an abstractive generate-and-rank sentence compressor. In Proc. of UCNLG+Eval-11, pp. 1–1 1. Galley, M. & K. R. McKeown (2007). Lexicalized Markov grammars for sentence compression. In Proc. of NAACL-HLT-07, pp. 180–187. Gillick, D. & B. Favre (2009). A scalable global model for summarization. In ILP for NLP-09, pp. 10–18. Grefenstette, G. (1998). Producing intelligent telegraphic text reduction to provide an audio scanning service for the blind. In Working Notes of the Workshop on Intelligent Text Summarization, Palo Alto, Cal., 23 March 1998, pp. 111–1 17. Hori, C. & S. Furui (2004). Speech summarization: An approach through word extraction and a method for evaluation. IEEE Transactions on Information and Systems, E87-D(1): 15–25. Jing, H. & K. McKeown (2000). Cut and paste based text summarization. In Proc. of NAACL-00, pp. 178–185. Knight, K. & D. Marcu (2000). Statistics-based summarization – step one: Sentence compression. In Proc. of AAAI-00, pp. 703–71 1. Mani, I. (2001). Automatic Summarization. Amsterdam, Philadelphia: John Benjamins. 1491 McDonald, R. (2006). Discriminative sentence compression with soft syntactic evidence. In Proc. of EACL-06, pp. 297–304. Napoles, C., C. Callison-Burch, J. Ganitkevitch & B. Van Durme (201 1). Paraphrastic sentence compression with a character-based metric: Tightening without deletion. In Proceedings of the Workshop on Monolingual Text-to-text Generation, Prtland, OR, June 24 2011, pp. 84–90. Nivre, J. (2006). Springer. Inductive Dependency Parsing. Nomoto, T. (2008). A generic sentence trimmer with CRFs. In Proc. of ACL-HLT-08, pp. 299–307. Nomoto, T. (2009). A comparison of model free versus model intensive approaches to sentence compression. In Proc. of EMNLP-09, pp. 391–399. Riezler, S., T. H. King, R. Crouch & A. Zaenen (2003). Statistical sentence condensation using ambiguity packing and stochastic disambiguation methods for Lexical-Functional Grammar. In Proc. of HLT-NAACL-03, pp. 118–125. Turner, J. & E. Charniak (2005). Supervised and unsupervised learning for sentence compression. In Proc. of ACL-05, pp. 290–297. Woodsend, K. & M. Lapata (2010). Automatic generation of story highlights. In Proc. of ACL-10, pp. 565–574. Woodsend, K. & M. Lapata (2012). Multiple aspect summarization using Integer Linear Programming. In Proc. of EMNLP-12, pp. 233–243. Wubben, S., A. van den Bosch, E. Krahmer & E. Marsi (2009). Clustering and matching headlines for automatic paraphrase acquisition. In Proc. of ENLG-09, pp. 122–125. Zajic, D., B. J. Dorr, J. Lin & R. Schwartz (2007). Multi-candidate reduction: Sentence compression as a tool for document summarization tasks. Information Processing & Management, Special Issue on Text Summarization, 43(6): 1549–1570.

5 0.14588808 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

Author: Tsutomu Hirao ; Yasuhisa Yoshida ; Masaaki Nishino ; Norihito Yasuda ; Masaaki Nagata

Abstract: Recent studies on extractive text summarization formulate it as a combinatorial optimization problem such as a Knapsack Problem, a Maximum Coverage Problem or a Budgeted Median Problem. These methods successfully improved summarization quality, but they did not consider the rhetorical relations between the textual units of a source document. Thus, summaries generated by these methods may lack logical coherence. This paper proposes a single document summarization method based on the trimming of a discourse tree. This is a two-fold process. First, we propose rules for transforming a rhetorical structure theorybased discourse tree into a dependency-based discourse tree, which allows us to take a tree- . trimming approach to summarization. Second, we formulate the problem of trimming a dependency-based discourse tree as a Tree Knapsack Problem, then solve it with integer linear programming (ILP). Evaluation results showed that our method improved ROUGE scores.

6 0.12672411 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

7 0.11935897 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition

8 0.11866799 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

9 0.1148776 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

10 0.098919846 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

11 0.093874425 145 emnlp-2013-Optimal Beam Search for Machine Translation

12 0.074078038 2 emnlp-2013-A Convex Alternative to IBM Model 2

13 0.05929327 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

14 0.058812328 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

15 0.056823038 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion

16 0.055193137 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

17 0.054509725 160 emnlp-2013-Relational Inference for Wikification

18 0.054054208 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

19 0.051353533 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

20 0.049401335 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.189), (1, 0.029), (2, -0.026), (3, 0.209), (4, -0.121), (5, -0.045), (6, 0.467), (7, -0.248), (8, 0.085), (9, 0.022), (10, 0.102), (11, -0.012), (12, 0.117), (13, -0.023), (14, -0.025), (15, -0.09), (16, 0.004), (17, -0.062), (18, -0.006), (19, 0.02), (20, -0.006), (21, 0.033), (22, -0.006), (23, -0.021), (24, -0.003), (25, -0.01), (26, -0.031), (27, -0.021), (28, 0.019), (29, 0.028), (30, -0.058), (31, 0.019), (32, -0.01), (33, -0.037), (34, -0.017), (35, 0.048), (36, 0.034), (37, 0.014), (38, -0.001), (39, 0.035), (40, 0.008), (41, 0.0), (42, -0.011), (43, 0.04), (44, 0.003), (45, -0.063), (46, -0.017), (47, -0.018), (48, -0.074), (49, -0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96643895 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

Author: Xian Qian ; Yang Liu

Abstract: Extractive summarization typically uses sentences as summarization units. In contrast, joint compression and summarization can use smaller units such as words and phrases, resulting in summaries containing more information. The goal of compressive summarization is to find a subset of words that maximize the total score of concepts and cutting dependency arcs under the grammar constraints and summary length constraint. We propose an efficient decoding algorithm for fast compressive summarization using graph cuts. Our approach first relaxes the length constraint using Lagrangian relaxation. Then we propose to bound the relaxed objective function by the supermodular binary quadratic programming problem, which can be solved efficiently using graph max-flow/min-cut. Since finding the tightest lower bound suffers from local optimality, we use convex relaxation for initialization. Experimental results on TAC2008 dataset demonstrate our method achieves competitive ROUGE score and has good readability, while is much faster than the integer linear programming (ILP) method.

2 0.88455224 65 emnlp-2013-Document Summarization via Guided Sentence Compression

Author: Chen Li ; Fei Liu ; Fuliang Weng ; Yang Liu

Abstract: Joint compression and summarization has been used recently to generate high quality summaries. However, such word-based joint optimization is computationally expensive. In this paper we adopt the ‘sentence compression + sentence selection’ pipeline approach for compressive summarization, but propose to perform summary guided compression, rather than generic sentence-based compression. To create an annotated corpus, the human annotators were asked to compress sentences while explicitly given the important summary words in the sentences. Using this corpus, we train a supervised sentence compression model using a set of word-, syntax-, and documentlevel features. During summarization, we use multiple compressed sentences in the integer linear programming framework to select . salient summary sentences. Our results on the TAC 2008 and 2011 summarization data sets show that by incorporating the guided sentence compression model, our summarization system can yield significant performance gain as compared to the state-of-the-art.

3 0.79032022 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

Author: Tengfei Ma ; Hiroshi Nakagawa

Abstract: Document summarization is an important task in the area of natural language processing, which aims to extract the most important information from a single document or a cluster of documents. In various summarization tasks, the summary length is manually defined. However, how to find the proper summary length is quite a problem; and keeping all summaries restricted to the same length is not always a good choice. It is obviously improper to generate summaries with the same length for two clusters of documents which contain quite different quantity of information. In this paper, we propose a Bayesian nonparametric model for multidocument summarization in order to automatically determine the proper lengths of summaries. Assuming that an original document can be reconstructed from its summary, we describe the ”reconstruction” by a Bayesian framework which selects sentences to form a good summary. Experimental results on DUC2004 data sets and some expanded data demonstrate the good quality of our summaries and the rationality of the length determination.

4 0.68272871 149 emnlp-2013-Overcoming the Lack of Parallel Data in Sentence Compression

Author: Katja Filippova ; Yasemin Altun

Abstract: A major challenge in supervised sentence compression is making use of rich feature representations because of very scarce parallel data. We address this problem and present a method to automatically build a compression corpus with hundreds of thousands of instances on which deletion-based algorithms can be trained. In our corpus, the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence supervised systems which require a structural alignment between the input and output can be successfully trained. We also extend an existing unsupervised compression method with a learning module. The new system uses structured prediction to learn from lexical, syntactic and other features. An evaluation with human raters shows that the presented data harvesting method indeed produces a parallel corpus of high quality. Also, the supervised system trained on this corpus gets high scores both from human raters and in an automatic evaluation setting, significantly outperforming a strong baseline. 1 Introduction and related work Sentence compression is a paraphrasing task where the goal is to generate sentences shorter than given while preserving the essential content. A robust compression system would be useful for mobile devices as well as a module in an extractive summarization system (Mani, 2001). Although a compression may differ lexically and structurally from the source sentence, to date most systems are extractive and proceed by deleting words from the 1481 input (Knight & Marcu, 2000; Dorr et al., 2003; Turner & Charniak, 2005; Clarke & Lapata, 2008; Berg-Kirkpatrick et al., 2011, inter alia). To decide which words, dependencies or phrases can be dropped, (i) rule-based approaches (Grefenstette, 1998; Jing & McKeown, 2000; Dorr et al., 2003; Zajic et al., 2007), (ii) supervised models trained on parallel data (Knight & Marcu, 2000; Turner & Charniak, 2005; McDonald, 2006; Gillick & Favre, 2009; Galanis & Androutsopoulos, 2010, inter alia) and (iii) unsupervised methods which make use of statistics collected from non-parallel data (Hori & Furui, 2004; Zajic et al., 2007; Clarke & Lapata, 2008; Filippova & Strube, 2008) have been investigated. Since it is infeasible to manually devise a set of accurate deletion rules with high coverage, recent research has been devoted to developing statistical methods and possibly augmenting them with a few linguistic rules to improve output readability (Clarke & Lapata, 2008; Nomoto, 2009). Supervised models. A major problem for supervised deletion-based systems is very limited amount of parallel data. Many approaches make use of a small portion of the Ziff-Davis corpus which has about 1K sentence-compression pairs1 . Other main sources of training data are the two manually crafted compression corpora from the University of Edinburgh (“written” and “spoken”, each approx. 1.4K pairs). Galanis & Androutsopoulos (201 1) attempt at getting more parallel data by applying a deletionbased compressor together with an automatic para1The method of Galley & McKeown (2007) could benefit from a larger number of sentences. Proce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is4t8ic1s–1491, phraser and generating multiple alternative compressions. To our knowledge, this extended data set has not yet been used for successful training of compression systems. Scarce parallel data makes it hard to go beyond a small set of features and explore lexicalization. For example, Knight & Marcu (2000) only induce nonlexicalized CFG rules, many of which occurred only once in the training data. The features of McDonald (2006) are formulated exclusively in terms of syntactic categories. Berg-Kirkpatrick et al. (201 1) have as few as 13 features to decide whether a constituent can be dropped. Galanis & Androutsopoulos (2010) use many features when deciding which branches of the input dependency tree can be pruned but require a reranker to select most fluent compressions from a pool of candidates generated in the pruning phase, many of which are ungrammatical. Even further data limitations exist for the algorithms which operate on syntactic trees and reformulate the compression task as a tree pruning one (Nomoto, 2008; Filippova & Strube, 2008; Cohn & Lapata, 2009; Galanis & Androutsopoulos, 2010, inter alia). These methods are sensitive to alignment errors, their performance degrades if the syntactic structure of the compression is very different from that of the input. For example, see Nomoto’s 2009 analysis of the poor performance of the T3 system of Cohn & Lapata (2009) when retrained on a corpus of loosely similar RSS feeds and news. Unsupervised models. Few approaches require no training data at all. The model of Hori & Furui (2004) combines scores estimated from monolingual corpora to generate compressions of transcribed speech. Adopting an integer linear programming (ILP) framework, Clarke & Lapata (2008) use hand-crafted syntactic constraints and an ngram language model, trained on uncompressed sentences, to find best compressions. The model of Filippova & Strube (2008) also uses ILP but the problem is formulated over dependencies and not ngrams. Conditional probabilities and word counts collected from a large treebank are combined in an ad hoc manner to assess grammatical importance and informativeness of dependencies. Similarly, Woodsend & Lapata (2010) formulate an ILP problem to gener- ate news story highlights using precomputed scores. 1482 Again, an ad hoc combination of the scores learned independently of the task is used in the objective function. Contributions of this paper. Our work is motivated by the obvious need for a large parallel corpus of sentences and compressions on which extractive systems can be trained. Furthermore, we want the compressions in the corpus to be structurally very close to the input. Ideally, in every pair, the compression should correspond to a subtree of the input. To this end, our contributions are three-fold: • • We describe an automatic procedure of constructing a parallel corpus ocf p 250,000 ese onften cocen-compression pairs such that the dependency tree of the compression is a subtree of the source tree. An evaluation with human raters demonstrates high quality of the parallel data in terms of readability and informativeness. We successfully apply the acquired data to train a neo svueclc supervised compression system ow thraicinh produces readable and informative compressions without employing a separate reranker. In particular, we start with the unsupervised method of Filippova & Strube (2008) and replace the ad hoc edge weighting with a linear function over a rich feature representation. The parameter vector is learned from our corpus specifically for the compression task using structured prediction (Collins, 2002). The new system significantly outperforms the baseline and hence provides further evidence for the utility of the parallel data. • We demonstrate that sparse lexical features are very eusmeofunls tfroart ese tnhtaetn spcea compression, aunreds st aharet a large parallel corpus is a requirement for applying them successfully. The compression framework we adopt and the unsupervised baseline are introduced in Section 2, the training algorithm for learning edge weights from parallel data is described in Section 3. In Section 4 we explain how to obtain the data and present an evaluation of its quality. In Section 5 we compare the baseline with our system and report the results of an experiment with humans as well as the results of an automatic evaluation. 2 Framework and baseline We adopt the unsupervised compression framework of Filippova & Strube (2008) as our baseline and extend it to a supervised structured prediction problem. In the experiments reported by Filippova & Strube (2008), the system was evaluated on the Edinburgh corpora. It achieved an F-score (Riezler et al., 2003) higher than reported by other systems on the same data under an aggressive compression rate and thus presents a competitive baseline. Tree pruning as optimization. In this framework, compressions are obtained by deleting edges of the source dependency structure so that (1) the retained edges form a valid syntactic tree, and (2) their total edge weight is maximized. The objective function is defined over set X = {xe, e ∈ E} of binary variables, corresponding {tox t,hee s∈et EE} o off t bhiesource edges, subject to the structural and length constraints, ×× = f(X) Xxe w(e) (1) Xe∈E Here, w(e) denotes the weight of edge e. This constrained optimization problem is solved under the tree structure and length constraints using ILP. If xe is resolved to 1, the respective edge is retained, otherwise it is deleted. The tree structure constraints enforce at most one parent for every node and structure connectivity (i.e., no disconnected subtrees). Given that length(node(e)) denotes the length of the node to which edge e points and α is the maximum permitted length for the compression, the length constraint is simply Xxe Xe∈E length(node(e)) ≤ α (2) Word limit is used in the original paper, whereas we use character length which is more appropriate for system comparisons (Napoles et al., 2011). If uniform weights are used in Eq. (1), the optimal solution would correspond to a subtree covering as many edges as possible while keeping the compression length under given limit. The solution to the surface realization problem (Belz et al., 2011) is standard: the words in the compression subtree are put in the same order they are found in the source. 1483 Due to space limitations, we refer the reader to (Filippova & Strube, 2008) for a detailed description on the method. Essential for the present discussion is that source dependency trees are transformed to dependency graphs in that (1) auxiliary, determiner, preposition, negation and possessive nodes are collapsed with their heads; (2) prepositions replace labels on the edges to their arguments; (3) the dummy root node is connected with every inflected verb. Figures 1(a)-1(b) illustrate most of the transformations. The transformations are deterministic and reversible, they can be implemented in a single top-down tree traversal2. The set E of edges in Eq. (1) is thus the set of edges of the transformed dependency graph, like in Fig. 1(b). A benefit of the transformations is that function words and negation appear in the compression if and only if their head words are present. Hence no separate constraints are required to en- × sure that negation or a determiner is preserved. The dummy root node makes constraint formulation easier and also allows for the generation of compressions from any finite clause of the source. The described pruning optimization framework is used both for the unsupervised baseline and for our supervised system. The difference between the baseline and our system is in how edge weights, w(e)’s in Eq. (1), are instantiated. Baseline edge weights. The precomputed edge weights reflect syntactic importance as well as informativeness of the nodes they point to. Given edge e from head node h to node n, the edge weight is the product of the syntactic and the informativeness weights, w(e) = wsynt(e) winfo(e) (3) The syntactic weight is defined as wsynt(e) = P(label(e) |lemma(h)) (4) For example, verb kill may have multiple arguments realized with dependency labels subj, dobj, in, etc. However, these argument labels are not equally likely, e.g., P(subj|kill) > P(in|kill) . When forced to prune an edge, t|hkiel system iwno|kuillld) prefer to keep 2Some of the transformations are comparable to what is implemented in the Stanford parser (de Marneffe et al., 2006). pspos prepnspoub j ro tdetamodc ompnsuabujxpas prep doebtjamodprep oabmjod ro tBritan’sMinsrto ryotfsoufbDjef nrsoeotsaysBritsha(bm)ocadcTosrmoaplndsifeor smubwejdagsrkapihledroadisnidaemoidnablast outheinr amiondAfghanistan ro tBritshamodrao stoldiersubjwas kiledinin a blastin Afghanistan Britain ’s Ministry of Defense says a British soldier was killed in a roadside blast in southern Afghanistan (a) Source dependency tree (c) Tree of extracted headline A British soldier was killed in a blast in Afghanistan detamodsubajuxpassrootpreppodbejtpreppobj A British soldier was killed in a blast in Afghanistan (d) Tree of extracted headline with transformations undone Figure 1: Source, transformed and extracted trees given headline British soldier killed in Afghanistan the subject edge over the preposition-in edge since it contributes more weight to the objective function. The informativeness score is inspired by Wood- send & Lapata (2012) and is defined as winfo(e) =PPhaeartdicl iene((lleemmmmaa((nn)))) (5) This weight tells us how likely it is that a word from an article appears in the headline. For exam- ple, given two edges one of which points to verb say and another one to verb kill, the latter would be preferred over the former because kill is more “headliny” than say. When collecting counts for the syntactic and informativeness scores, we used 9M news articles crawled from the Internet, much more than Filippova & Strube (2008). As a result our estimates are probably more accurate than theirs. Although both wsynt and winfo have a meaningful interpretation, there is no guarantee that product is the best way to combine the two when assigning edge weights. Also, it is unclear how to integrate other signals, such as distance to the root, node length or information about the siblings, which pre1484 sumably all play a role in determining the overall edge importance. 3 Learning edge weights Our supervised system differs from the unsupervised baseline in that instead of relying on precomputed scores, we define edge weight w(e) in Eq. (1) with a linear function over a feature representation, w(e) = w · f(e) (6) Here f(e) is a vector of binary variables for every feature from the set of all possible but very infrequent features in the training set. f(e) has 1for every feature extracted for edge e and zero otherwise. Table 1 gives an overview of the feature types we use (edge e points from head h to node n). Note that syntactic, structural and semantic features are closed-class. For all the structural features but char length, seven is used as maximum possible value; all possible character lengths are bucketed into six classes. All the features are local for a given edge, contextual information is included about – syntacticlabel(e); for e* to h, label(e*); pos(h); pos(n) structural depth(n); #children(n); #children(h); char length(n); #words in(n) semantic NE tag(h); NE tag(n); is negated(n) lexical lemma(n); lemma(h)-label(e); for e* to n’s siblings, lemma(h)-label(e*) Table 1: Types of features extracted for edge e from h to n the head and the target nodes, and the siblings as well as the children of the latter. The negation feature is only applicable to verb nodes which contain a negative particle, like not, after the tree transformations. Lexical features which combine lemmas and syntactic labels are inspired by the unsupervised baseline and are very sparse. In what follows, our assumption is that we have a compression corpus at our disposal where for every input sentence there is a correct “oracle” compression such that its transformed parse tree matches a subtree of the transformed input graph. Given such a corpus, we can apply structured prediction methods to learn the parameter vector w. In our study we employ an averaged variant of online structured perceptron (Collins, 2002). In the context of sentence fusion, a similar dependency structure pruning framework and a similar learning approach was adopted by Elsner & Santhanam (201 1). At every iteration, for every input graph, we find the optimal solution with ILP under the current parameter vector w. The maximum permitted compression length is set to be the same as the length of the oracle compression. Since the oracle compression is a subtree of the input graph, it represents a feasible solution for ILP. The parameter vector is updated if there is a mismatch between the predicted and the oracle sets of edges for all the features with a non-zero net count. More formally, given an input graph with the set of edges E, oracle compression C ⊂ E and compression Ct ⊆ E predicted at itera- tCion ⊂ ⊂t , t ahen parameter update ⊆vec Etor p raetd ti + 1d aist given by wt+1 = wt+ e∈XC\Ct f(e) −X f(e) X e∈XCt\C (7) w is averaged over all the wt’s so that features whose weight fluctuated a lot during training are penalized (Freund & Shapire, 1999). 1485 Of course, training a model with a large number of features, such as a lexicalized model, is only possible if there is a large compression corpus where the dependency tree of the compression is a subtree of the source sentence. In the next section we introduce our method of getting a sufficient amount of such data. 4 Acquiring parallel data automatically In this section we explain how we obtained a parallel corpus of sentences and compressions. The underlying idea is to harvest news articles from the Internet where the headline appears to be similar to the first sentence and use it to find an extractive compression of the sentence. Collecting headline-sentence pairs. Using a news crawler, we collected a corpus of news articles in English from the Internet. Similarly to previous work (Dolan et al., 2004; Wubben et al., 2009; Bejan & Harabagiu, 2010, inter alia), the Google News service3 was used to identify news. From every article, the headline and the first sentence, which are known to be semantically similar (Dorr et al., 2003), were extracted. Predictably, very few headlines are extractive compressions of the first sentence, therefore simply looking for pairs where the headline is a subsequence of the words from the first sentence would not solve the problem of getting a large amount of parallel data. Importantly, headlines are syntactically quite different from “normal” sentences. For example, they may have no main verb, omit determiners and appear incomplete, making it hard for a supervised deletion-based system to learn useful rules. Moreover, we observed poor parsing accuracy for headlines which would make syntactic annotations for headlines hardly useful. Thus, instead oftaking the headline as it is, we use it to find a proper extractive compression of the sen3http : / /news .google . com, Jan-Dec 2012. tence by matching lemmas of content words (nouns, verbs, adjectives, adverbs) and coreference IDs of entities from the headline with those of the sentence. The exact procedure is as follows (H, S and T stand for headline, sentence and transformed graph of the sentence): PREPROCESSING H and S are preprocessed in a standard way: tokenized, lemmatized, PoS and NE tagged. Additionally, S is parsed with a dependency parser (Nivre, 2006) and transformed as described in Section 2 to obtain T. Finally, pronominal anaphora is resolved in S. Recall that S is the first sentence, so the antecedent must be located in a preceding, higher-level clause. FILTERING To restrict the corpus to grammatical and informative headlines, we implemented a cascade of filters. Pair (H, S) is discarded if any of the questions in Table 2 is answered positively. Is H a question? Is H or S too short? (less than four word tokens) Is H about as long as S? (min ratio: 1.5) Does H lack a verb? Does H begin with a verb? Is there a noun, verb, adj, adv lemma from H not found in S? Are the noun, verb, adj, adv lemmas from H found in S in a different order? Table 2: Filters applied to candidate pair (H , S) MATCHING Given the content words of H, a subset of nodes in T is selected based on lemma or coreference identity of the main (head) word in the nodes. For example, the main word of a collapsed node in T, which covers two words was killed, is killed; was is its child attached with label aux in the untransformed parse tree. This node is marked if H contains word killed or killing because of the lemma identity. In some cases there are multiple possible matches. For example, given S Barack Obama said he will attend G20 and H mentioning Obama, both Barack Obama and he nodes are marked in T. Once all the nodes in T which match content words and entities from H are identified, a minimum subtree covering these nodes is found such that every word or entity from H occurs as many times in T as in 1486 H. So if H mentions Obama only once, then either Barack Obama or he must be covered by the subtree but not both. This minimum subtree corresponds to an extractive headline, H*, which we generate by ordering the surface forms of all the words in the subtree nodes by their offsets in S. Finally, the character length of H* is compared with the length of H. If H* is much longer than H, the pair (H S) is discarded (max ratio 1.5). As an illustration to the procedure, consider the example from Figure 1 with the extracted headline and its tree presented in Figure 1(c). Given the headline British soldier killed in Afghanistan, the extracted headline would be A British soldier was killed in a blast in Afghanistan. The lemmas british, soldier, kill, afghanistan from the headline match the nodes British, a soldier, was killed, in Afghanistan in the transformed graph. The node in a blast is added because it is on the path from was killed to in Afghanistan. Of course, it is possible to determinis- , tically undo the transformations in order to obtain a standard dependency tree. In this case the extracted headline would still correspond to a subtree of the input (compare Fig. 1(d) with Fig. 1(a)). Also note that a similar procedure can be implemented for constituency parses. The resulting corpus consists of 250K tuples (S T H H*), Appendix provides more examples of source sentences, original headlines and extracted headlines. We did not attempt to tune the values for minimum/maximum length and ratio lower thresholds may have produced comparable results. , , , – Evaluating data quality. The described procedure produces a comparatively large compression corpus but how good are automatically constructed compressions? To answer this question, we randomly selected 50 tuples from the corpus and set up an experiment with human raters to validate and assess data quality in terms of readability4 and informativeness5 which are standard measures of compression quality (Clarke & Lapata, 2006). Raters were asked to read a sentence and a compression (original H or extracted H* headline) and then rate the compression on two five-point scales. Three rat- ings were collected for every item. Table 3 gives 4Also called grammaticality and fluency. 5Also called importance and representativeness. average ratings with standard deviation. AVG read AVG info ORIG. HEADLINE4.36 (0.75)3.86 (0.79) EXTR. HEADLINE 4.26 (1.01) 3.70 (1.04) Table 3: Results for two kinds of headlines In terms of readability and informativeness the extracted headlines are comparable with humanwritten ones: at 95% confidence there is no statistically significant difference between the two. Encouraged by the results of the validation experiment we proceeded to our next question: Can a supervised compression system be successfully trained on this corpus? 5 System evaluation and discussion From the corpus of 250K tuples we used 100K to get pairs of extracted headlines and sentences for training (on the development set we did not observe much improvement from using more training data), 250 for development and the rest for testing. We ran the learning algorithm for 20 iterations, checking the performance on the development set. Features which applied to less than 20 edges were pruned, the size of the feature set is about 28K. 5.1 Evaluation with humans 50 pairs of original headlines and sentences (different from the data validation set in Sec. 4) were randomly selected for an evaluation with humans from the test data. As in the data quality validation experiment, we asked raters to assess the readability and informativeness of proposed compressions for the unsupervised system, our system and humanwritten headlines. The latter provide us with upper bounds on the evaluation criteria. Three ratings per item per parameter were collected. To get comparable results, the unsupervised and our systems used the same compression rate: for both, the requested maximum length was set to the length of the headline. Table 4 summarizes the results. The results indicate that the trained model significantly outperforms the unsupervised system, getting particularly good marks for readability. The differ- ence in readability between our system and original headlines is not statistically significant. Note that 1487 O U RNISRGU. SPY H.S ETYAESMDTLEIMNE4A43V. 376G0 6† read32A4.V. 571G20 † i‡nfo Table 4: Results for the systems and original headline: † and ‡ stand for significantly better than Unsupervised and Our system at 95% confidence, respectively the unsupervised baseline is also capable of generating readable compressions but does a much poorer job in selecting most important information. Our trained model successfully learned to optimize both scores. We refer the reader to Appendix for input and compression examples. Note that the ratings for the human-written headlines in this experiment are slightly different from the ratings in the data validation experiment because a different data sample was used. 5.2 Automatic evaluation Our automatic evaluation had the goal of explicitly addressing two relevant questions related to our claims about (1) the benefits of having a large parallel corpus and (2) employing a supervised approach with a rich feature representation. 1. Our primary motivation for collecting parallel data has been that having access to sparse lexical features, which considerably increase the feature space, would benefit compression systems. But is it really the case for sentence compression? Can a comparable performance be achieved with a closed, moderately sized set of dense, non-lexical features? If yes, then a large compression corpus is probably not needed. Furthermore, to demonstrate that a large corpus is not only sufficient but also necessary to learn weights for thousands of features, we need to compare the performance of the system when trained on the full data set and a small portion of it. 2. The syntactic and informativeness scores in Eq. (3) were calculated over millions of news articles and do provide us with meaninful statistics (see Sec. 2). Is there any benefit in replacing those scores with weights learned for their feature counterparts? Recall that one of our feature types in Table 1 is the concatenation of lemma(h) (parent lemma) and label(e) which relies on the same information as wsynt = P(label(e) |lemma(h)). The feature counterpart bofe winfo dmemfinae(dh i))n. Eq. (5) aislemma(n)–the lemma of the node to which edge points. How would the supervised system perform against the unsupervised one, if it only extracted features of these two types? To answer these questions, we sampled 1,000 tuples from the unused test data and measured F1 score (Riezler et al., 2003) by comparing the trees of the generated compression and the “correct”, extracted headline. The systems we compared are the unsupervised baseline (UNSUP. SYSTEM) and the supervised model trained on three kinds of feature sets: (1) SYNT-INFO FEATURES, corresponding to the supervised training of the unsupervised baseline model (i.e., lemma(h)-label(e) and lemma(n)); (2) NON-LEX FEATURES, corresponding to a dense, non-lexical feature representation (i.e., all the feature types from Table 1 excluding the three involving lemmas); (3) ALL FEATURES (same as OUR SYSTEM). Additionally, we trained the system on 10% of the data–10K as opposed to 100K tuples, ALL FEATURES ( 10K)–for 20 iterations ignoring features which applied to less than three edges6. As before, the same compression rate was used for all the systems. The results are summarized in Table 5. SANUYL ONSLTUF-LEPI.NASXTFYUOFSRE TAE STAMU(1R0ESK)F1578249s1c.0o364re#f12a7tN,u4835r.1A29e30s. Table 5: Results for the unsupervised baseline and the supervised system trained on three kinds of feature sets Clearly, having more features, lexicalized and unlexicalized, is important: there is a significant im6Recall from the beginning of the section that for the full (100K) training set the threshold was set to 20 with no tuning. For the 10K training set, we tried values of two, three, five and varied the number of iterations. The result we report is the highest we could get for 10K. 1488 provement in going beyond the closed set of 330 non-lexical features to all, from 79.6 to 84.3 points. Moreover, successful training requires a large corpus since the performance of the system degrades if only 10K training instances are used. Note that this number already exceeds all the existing compression corpora taken together. Hence, sparse lexical features are useful for compression and a large parallel corpus is a requirement for successful supervised training. Concerning our second question, learning feature weights from the data produces significantly better results than the hand-crafted way of making use of the same information, even if a much larger data set is used to collect statistics. We observed a dramatic increase from 52.3 to 75.0 points. Thus, we may conclude that training with dense and sparse features directly from data definitely improves the performance of the dependency pruning system. 5.3 Discussion It is important to note that the data we used is challenging: first sentences in news articles tend to be long, in fact longer than other news sentences, which implies less reliable syntactic analysis and noisier input to the syntax-based systems. In the test set we used for the evaluation with humans, the mean sentence length is 165 characters. The average compression rate in characters is 0.46 0. 16 which is quite iaogng rraetsesiv ine7 c. hRareaccatlel rtsha ist we u6se ±d 0th.1e6 very same framework for the unsupervised baseline and our system as well as the same compression rate. All the preprocessing errors affect both systems equally and the comparison of the two is fair. Predictably, wrong syntactic parses significantly increase chances of an ungrammatical compression, and parser errors seem to be a major source of readability deficiencies. A property of the described compression framework is that a desired compression length is expected to be provided by the user. This can be seen both as a strength and as a weakness, depending on the application. In a scenario where mobile devices with a limited screen size are used, or in a summarization scenario where a total summary length is ± provided (see the DUC/TAC guidelines8), being able 7We follow the standard terminology where smaller values imply shorter compressions. 8http : / /www .nist .gov/t ac / to specify a length is definitely an advantage. However, one can also think of other applications where the user does not have a strict length constraint but wants the text to be somewhat shorter. In this case, a reranker which compares compressions generated for a range of possible lengths can be employed to find a single compression (e.g., mean edge weight in the solution or a language model-based score). 6 Conclusions We have addressed a major problem for supervised extractive compression models the lack of a large parallel corpus. To this end, we presented a method to automatically build such a corpus from web documents available on the Internet. An evaluation with humans demonstrates that the quality of the corpus is high the compressions are grammatical and informative. We also significantly improved a competitive unsupervised method achieving high readability and informativeness scores by incorpo– – rating thousands of features and learning the feature weights from our corpus. This result further confirms the practical utility of the automatically obtained data. We have shown that employing lexical features is important for sentence compression, and that our supervised module can successfully learn their weights from the corpus. To our knowledge, we are the first to empirically demonstrate that sparse features are useful for compression and that a large parallel corpus is a requirement for a successful learning of their weights. We believe that other supervised deletion-based systems can benefit from our work. Acknowledgements: The authors are thankful to the EMNLP reviewers for their feedback and suggestions. Appendix The appendix presents examples of source sentences (S), original headlines (H), extracted headlines (H*), unsupervised baseline (U) and our system (O) compressions. 1489 References Bejan, C. & S. Harabagiu (2010). Unsupervised event coreference resolution with rich linguistic features. In Proc. of ACL-10, pp. 1412–1422. Belz, A., M. White, D. Espinosa, E. Kow, D. Hogan & A. Stent (201 1). The first surface realization shared task: Overview and evaluation results. In Proc. of ENLG-11, pp. 217–226. Berg-Kirkpatrick, T., D. Gillick & D. Klein (201 1). Jointly learning to extract and compress. In Proc. of ACL-11. Clarke, J. & M. Lapata (2006). Models for sentence compression: A comparison across domains, training requirements and evaluation measures. In Proc. of COLING-ACL-06, pp. 377–385. Clarke, J. & M. Lapata (2008). Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research, 3 1:399–429. Cohn, T. & M. Lapata (2009). Sentence compres- sion as tree transduction. Journal of Artificial Intelligence Research, 34:637–674. Collins, M. (2002). Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. In Proc. of EMNLP-02, pp. 1–8. de Marneffe, M.-C., B. MacCartney & C. D. Manning (2006). Generating typed dependency parses from phrase structure parses. In Proc. of LREC06, pp. 449–454. Dolan, B., C. Quirk & C. Brokett (2004). Unsupervised construction oflarge paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004, pp. 350–356. Dorr, B., D. Zajic & R. Schwartz (2003). Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the Text Summarization Workshop at HLT-NAACL-03, Edmonton, Alberta, Canada, 2003, pp. 1–8. SCountry star Sara Evans has married former University of Alabama quarterback Jay Barker. H H* U O Country star Sara Evans marries Country star Sara Evans has married Sara Evans has married Jay Barker Sara Evans has married Jay Barker H H* U O Intel to build car batteries Intel would be building car batteries would be building the company said Intel would be building car batteries SIntel would be building car batteries, expanding its business beyond its core strength, the company said in a statement SA New Orleans Saints team spokesman says tight end Jeremy Shockey was taken to a hospital but is doing fine. H H* U O Spokesman: Shockey taken to hospital, doing fine spokesman says Jeremy Shockey was taken to a hospital but is doing fine A New Orleans Saints team spokesman says Jeremy Shockey was taken tight end Jeremy Shockey was taken to a hospital but is doing fine SPresident Obama declared a major disaster exists in the State of Florida and ordered Federal aid to supplement H H* U O State and beginning President President President President local recovery efforts in the area struck by severe storms, flooding, tornadoes, and straight-line winds on May 17, 2009, and continuing. Obama declares major disaster exists in the State of Florida Obama declared a major disaster exists in the State of Florida Obama declared a major disaster exists and ordered Federal aid Obama declared a major disaster exists in the State of Florida H H* U O mounting loan defaults. Regulators shut down small Florida bank Regulators shut down a small Florida bank shut down bringing the number of failures Regulators shut down a small Florida bank SRegulators Friday shut down a small Florida bank, bringing to 119 the number of US bank failures this year amid SThree men were arrested Wednesday night and Dayton police said their arrests are in connection to a west Dayton H H* U O bank robbery. 3 men arrested in connection with Bank robbery Three men were arrested are in connection to a bank robbery were arrested and Dayton police said their arrests are Three men were arrested and police said their arrests are SThe government and the social partners will resume the talks on the introduction of the so-called crisis tax, H H* U O which will be levied on all salaries, pensions and incomes over HRK 3,000. Government, social partners to resume talks on introduction of “crisis” tax. The government and the social partners will resume the talks on the introduction of the crisis tax The government will resume the talks on the introduction of the crisis tax which will be levied The government and the social partners will resume the talks on the introduction of the crisis tax SEngland star David Beckham may have the chance to return to AC Milan after the Italian club’s coach said H H* U O he was open to his move on Sunday. Beckham has chance of returning to Milan David Beckham may have the chance to return to AC Milan David Beckham may have the chance to return said star was David Beckham may have the chance to return to AC Milan SEastern Health and its insurance company have accepted liability for some patients involved in the breast cancer H H* U O testing scandal, according to a statement released Friday afternoon. Eastern Health accepts liability for some patients Eastern Health have accepted liability for some patients Health have accepted liability according to a statement Eastern Health have accepted liability for some patients SFrontier Communications Corp., a provider of phone, TV and Internet services, said Thursday H H* U it has started a cash tender offer to purchase up to $700 million of its notes. Frontier Communications starts tender offer for up to $700 million of notes Frontier Communications has started a tender offer to purchase $700 million of its notes Frontier Communications said Thursday a provider has started a tender offer OFrontier Communications has started a tender offe1r4 to9 p0urchase $700 million of its notes Elsner, M. & D. Santhanam (201 1). Learning to fuse disparate sentences. In Proceedings of the Workshop on Monolingual Text-to-text Generation, Prtland, OR, June 24 2011, pp. 54–63. Filippova, K. & M. Strube (2008). Dependency tree based sentence compression. In Proc. of INLG08, pp. 25–32. Freund, Y. & R. E. Shapire (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37:277–296. Galanis, D. & I. Androutsopoulos (2010). An extractive supervised two-stage method for sentence compression. In Proc. of NAACL-HLT-10, pp. 885–893. Galanis, D. & I. Androutsopoulos (201 1). A new sentence compression dataset and its use in an abstractive generate-and-rank sentence compressor. In Proc. of UCNLG+Eval-11, pp. 1–1 1. Galley, M. & K. R. McKeown (2007). Lexicalized Markov grammars for sentence compression. In Proc. of NAACL-HLT-07, pp. 180–187. Gillick, D. & B. Favre (2009). A scalable global model for summarization. In ILP for NLP-09, pp. 10–18. Grefenstette, G. (1998). Producing intelligent telegraphic text reduction to provide an audio scanning service for the blind. In Working Notes of the Workshop on Intelligent Text Summarization, Palo Alto, Cal., 23 March 1998, pp. 111–1 17. Hori, C. & S. Furui (2004). Speech summarization: An approach through word extraction and a method for evaluation. IEEE Transactions on Information and Systems, E87-D(1): 15–25. Jing, H. & K. McKeown (2000). Cut and paste based text summarization. In Proc. of NAACL-00, pp. 178–185. Knight, K. & D. Marcu (2000). Statistics-based summarization – step one: Sentence compression. In Proc. of AAAI-00, pp. 703–71 1. Mani, I. (2001). Automatic Summarization. Amsterdam, Philadelphia: John Benjamins. 1491 McDonald, R. (2006). Discriminative sentence compression with soft syntactic evidence. In Proc. of EACL-06, pp. 297–304. Napoles, C., C. Callison-Burch, J. Ganitkevitch & B. Van Durme (201 1). Paraphrastic sentence compression with a character-based metric: Tightening without deletion. In Proceedings of the Workshop on Monolingual Text-to-text Generation, Prtland, OR, June 24 2011, pp. 84–90. Nivre, J. (2006). Springer. Inductive Dependency Parsing. Nomoto, T. (2008). A generic sentence trimmer with CRFs. In Proc. of ACL-HLT-08, pp. 299–307. Nomoto, T. (2009). A comparison of model free versus model intensive approaches to sentence compression. In Proc. of EMNLP-09, pp. 391–399. Riezler, S., T. H. King, R. Crouch & A. Zaenen (2003). Statistical sentence condensation using ambiguity packing and stochastic disambiguation methods for Lexical-Functional Grammar. In Proc. of HLT-NAACL-03, pp. 118–125. Turner, J. & E. Charniak (2005). Supervised and unsupervised learning for sentence compression. In Proc. of ACL-05, pp. 290–297. Woodsend, K. & M. Lapata (2010). Automatic generation of story highlights. In Proc. of ACL-10, pp. 565–574. Woodsend, K. & M. Lapata (2012). Multiple aspect summarization using Integer Linear Programming. In Proc. of EMNLP-12, pp. 233–243. Wubben, S., A. van den Bosch, E. Krahmer & E. Marsi (2009). Clustering and matching headlines for automatic paraphrase acquisition. In Proc. of ENLG-09, pp. 122–125. Zajic, D., B. J. Dorr, J. Lin & R. Schwartz (2007). Multi-candidate reduction: Sentence compression as a tool for document summarization tasks. Information Processing & Management, Special Issue on Text Summarization, 43(6): 1549–1570.

5 0.5993526 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou

Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1

6 0.51555049 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

7 0.51367038 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

8 0.44832245 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition

9 0.43824276 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

10 0.41932636 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

11 0.36232743 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion

12 0.29515868 2 emnlp-2013-A Convex Alternative to IBM Model 2

13 0.26454172 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

14 0.24727784 145 emnlp-2013-Optimal Beam Search for Machine Translation

15 0.24418046 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

16 0.23762903 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction

17 0.23492873 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

18 0.22713496 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction

19 0.21904865 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics

20 0.21724404 176 emnlp-2013-Structured Penalties for Log-Linear Language Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.037), (18, 0.034), (22, 0.04), (26, 0.027), (30, 0.077), (50, 0.016), (51, 0.125), (66, 0.03), (71, 0.018), (75, 0.027), (77, 0.014), (95, 0.455), (96, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81654876 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

Author: Xian Qian ; Yang Liu

Abstract: Extractive summarization typically uses sentences as summarization units. In contrast, joint compression and summarization can use smaller units such as words and phrases, resulting in summaries containing more information. The goal of compressive summarization is to find a subset of words that maximize the total score of concepts and cutting dependency arcs under the grammar constraints and summary length constraint. We propose an efficient decoding algorithm for fast compressive summarization using graph cuts. Our approach first relaxes the length constraint using Lagrangian relaxation. Then we propose to bound the relaxed objective function by the supermodular binary quadratic programming problem, which can be solved efficiently using graph max-flow/min-cut. Since finding the tightest lower bound suffers from local optimality, we use convex relaxation for initialization. Experimental results on TAC2008 dataset demonstrate our method achieves competitive ROUGE score and has good readability, while is much faster than the integer linear programming (ILP) method.

2 0.72033679 1 emnlp-2013-A Constrained Latent Variable Model for Coreference Resolution

Author: Kai-Wei Chang ; Rajhans Samdani ; Dan Roth

Abstract: Coreference resolution is a well known clustering task in Natural Language Processing. In this paper, we describe the Latent Left Linking model (L3M), a novel, principled, and linguistically motivated latent structured prediction approach to coreference resolution. We show that L3M admits efficient inference and can be augmented with knowledge-based constraints; we also present a fast stochastic gradient based learning. Experiments on ACE and Ontonotes data show that L3M and its constrained version, CL3M, are more accurate than several state-of-the-art approaches as well as some structured prediction models proposed in the literature.

3 0.52444059 65 emnlp-2013-Document Summarization via Guided Sentence Compression

Author: Chen Li ; Fei Liu ; Fuliang Weng ; Yang Liu

Abstract: Joint compression and summarization has been used recently to generate high quality summaries. However, such word-based joint optimization is computationally expensive. In this paper we adopt the ‘sentence compression + sentence selection’ pipeline approach for compressive summarization, but propose to perform summary guided compression, rather than generic sentence-based compression. To create an annotated corpus, the human annotators were asked to compress sentences while explicitly given the important summary words in the sentences. Using this corpus, we train a supervised sentence compression model using a set of word-, syntax-, and documentlevel features. During summarization, we use multiple compressed sentences in the integer linear programming framework to select . salient summary sentences. Our results on the TAC 2008 and 2011 summarization data sets show that by incorporating the guided sentence compression model, our summarization system can yield significant performance gain as compared to the state-of-the-art.

4 0.51690614 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

Author: Hrushikesh Mohapatra ; Siddhanth Jain ; Soumen Chakrabarti

Abstract: Web search can be enhanced in powerful ways if token spans in Web text are annotated with disambiguated entities from large catalogs like Freebase. Entity annotators need to be trained on sample mention snippets. Wikipedia entities and annotated pages offer high-quality labeled data for training and evaluation. Unfortunately, Wikipedia features only one-ninth the number of entities as Freebase, and these are a highly biased sample of well-connected, frequently mentioned “head” entities. To bring hope to “tail” entities, we broaden our goal to a second task: assigning types to entities in Freebase but not Wikipedia. The two tasks are synergistic: knowing the types of unfamiliar entities helps disambiguate mentions, and words in mention contexts help assign types to entities. We present TMI, a bipartite graphical model for joint type-mention inference. TMI attempts no schema integration or entity resolution, but exploits the above-mentioned synergy. In experiments involving 780,000 people in Wikipedia, 2.3 million people in Freebase, 700 million Web pages, and over 20 professional editors, TMI shows considerable annotation accuracy improvement (e.g., 70%) compared to baselines (e.g., 46%), especially for “tail” and emerging entities. We also compare with Google’s recent annotations of the same corpus with Freebase entities, and report considerable improvements within the people domain.

5 0.48051679 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition

Author: Prateek Jindal ; Dan Roth

Abstract: This paper introduces IQPs (Integer Quadratic Programs) as a way to model joint inference for the task of concept recognition in clinical domain. IQPs make it possible to easily incorporate soft constraints in the optimization framework and still support exact global inference. We show that soft constraints give statistically significant performance improvements when compared to hard constraints.

6 0.40403998 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

7 0.39365837 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

8 0.38567862 149 emnlp-2013-Overcoming the Lack of Parallel Data in Sentence Compression

9 0.37925425 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

10 0.37226897 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

11 0.3674913 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion

12 0.36061278 160 emnlp-2013-Relational Inference for Wikification

13 0.35657132 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

14 0.35630944 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

15 0.35624188 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

16 0.35598814 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction

17 0.35528499 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

18 0.35023955 146 emnlp-2013-Optimal Incremental Parsing via Best-First Dynamic Programming

19 0.34912229 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

20 0.34237057 3 emnlp-2013-A Corpus Level MIRA Tuning Strategy for Machine Translation