emnlp emnlp2013 emnlp2013-174 knowledge-graph by maker-knowledge-mining

174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

Source: pdf

Author: Tsutomu Hirao ; Yasuhisa Yoshida ; Masaaki Nishino ; Norihito Yasuda ; Masaaki Nagata

Abstract: Recent studies on extractive text summarization formulate it as a combinatorial optimization problem such as a Knapsack Problem, a Maximum Coverage Problem or a Budgeted Median Problem. These methods successfully improved summarization quality, but they did not consider the rhetorical relations between the textual units of a source document. Thus, summaries generated by these methods may lack logical coherence. This paper proposes a single document summarization method based on the trimming of a discourse tree. This is a two-fold process. First, we propose rules for transforming a rhetorical structure theorybased discourse tree into a dependency-based discourse tree, which allows us to take a tree- . trimming approach to summarization. Second, we formulate the problem of trimming a dependency-based discourse tree as a Tree Knapsack Problem, then solve it with integer linear programming (ILP). Evaluation results showed that our method improved ROUGE scores.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 These methods successfully improved summarization quality, but they did not consider the rhetorical relations between the textual units of a source document. [sent-14, score-0.52]

2 Thus, summaries generated by these methods may lack logical coherence. [sent-15, score-0.089]

3 This paper proposes a single document summarization method based on the trimming of a discourse tree. [sent-16, score-0.583]

4 First, we propose rules for transforming a rhetorical structure theorybased discourse tree into a dependency-based discourse tree, which allows us to take a tree- . [sent-18, score-0.7]

5 Second, we formulate the problem of trimming a dependency-based discourse tree as a Tree Knapsack Problem, then solve it with integer linear programming (ILP). [sent-20, score-0.549]

6 1 Introduction State-of-the-art extractive text summarization methods regard a document (or a document set) as a set of textual units (e. [sent-22, score-0.549]

7 sentences, clauses, phrases) and formulate summarization as a combinatorial optimization problem, i. [sent-24, score-0.377]

8 selecting a subset of the set of textual units that maximizes an objective without violating a length constraint. [sent-26, score-0.204]

9 For example, McDonald (2007) formulated text summarization as a Knapsack Problem, where he selects a set of textual 1515 i . [sent-27, score-0.295]

10 j p st units that maximize the sum of significance scores of each unit. [sent-30, score-0.107]

11 (2004) proposed a summarization method based on a Maximum Coverage Problem, in which they select a set of textual units that maximizes the weighted sum of the conceptual units (e. [sent-32, score-0.509]

12 (2009b) regarded summarization as a Budgeted Median Problem and obtain exact solutions with integer linear programming. [sent-38, score-0.198]

13 Since these methods are based on subset selection, the summaries they generate cannot preserve the rhetorical structure of the textual units of a source document. [sent-40, score-0.411]

14 Thus, the resulting summary may lack coherence and may not include significant textual units from a source document. [sent-41, score-0.25]

15 One powerful and potential way to overcome the problem is to include discourse tree constraints in the summarization procedure. [sent-42, score-0.537]

16 Marcu (1998) regarded a document as a Rhetorical Structure Theory (RST) (William Charles, Mann and Sandra Annear, Thompson, 1988)-based discourse tree (RSTDT) and selected textual units according to a preference ranking derived from the tree structure to make a summary. [sent-43, score-0.745]

17 (2002) proposed a document compression method that directly models the probability of a summary given an RST-DT by using a noisy-channel model. [sent-45, score-0.131]

18 These methods generate well-organized summaries, however, since they do not formulate summarizations as combinatorial opProceeSdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et. [sent-46, score-0.149]

19 timization problems, the optimality of the generated summaries is not guaranteed. [sent-61, score-0.089]

20 In this paper, we propose a single document sum- marization method based on the trimming of a discourse tree based on the Tree Knapsack Problem. [sent-62, score-0.529]

21 If a discourse tree explicitly represents parent-child relationships between textual units, we can apply the well-known tree-trimming approach to a discourse tree and reap the benefit of combinatorial optimization methods. [sent-63, score-0.945]

22 In other words, to apply the treetrimming approach, we need a tree whose all nodes represent textual units. [sent-64, score-0.268]

23 Unfortunately, the RST-DT does not allow it, because textual units in the RSTDT are located only on leaf nodes and parent-child relationship between textual units are represented implicitly at higher positions in a tree. [sent-65, score-0.408]

24 Therefore, we first propose rules that transform an RST-DT into a dependency-based discourse tree (DEP-DT) that explicitly defines the parent-child relationships. [sent-66, score-0.366]

25 Second, we treat it as a rooted subtree selection, in other words, a Tree Knapsack Problem and formulate the problem as an ILP. [sent-67, score-0.256]

26 1 RST-DT According to RST, a document is represented as an RST-DT whose terminal nodes correspond to ele- mentary discourse units (EDU)s1 and whose nonterminal nodes indicate the role of the contiguous 1EDUs roughly correspond to clauses. [sent-69, score-0.414]

27 A nucleus is more important than a satellite in terms of the writer’s purpose. [sent-72, score-0.172]

28 That is, a satellite is a child of a nucleus in the RST-DT. [sent-73, score-0.227]

29 Some discourse relations such as ‘Elaboration’, ‘Contrast’ and ‘Evidence’ between a nucleus and a satellite or two nuclei are defined. [sent-74, score-0.367]

30 2 DEP-DT An RST-DT is not suitable for tree trimming because it does not always explicitly define parent-child relationships between textual units. [sent-77, score-0.442]

31 For example, if we consider how to trim the RST-DT in Figure 1, when we drop e8, we have to drop e7 because of the parent-child relationship defined between e7 and e8, i. [sent-78, score-0.082]

32 e7 is a satellite (child) of the nucleus (parent) e8. [sent-80, score-0.172]

33 On the other hand, we cannot judge whether we have to drop e9 or e10 because the parent-child relationships are not explicitly defined between e8 and e9, e8 and e10. [sent-81, score-0.11]

34 This view motivates us to produce a discourse tree that explicitly defines parent-child relationships and whose root node represents the most important EDU in a source document. [sent-82, score-0.485]

35 If we can obtain such a tree, it is easy to formulate summarization as a Tree Knapsack Problem. [sent-83, score-0.276]

36 To construct a discourse tree that represents the parent-child relationships between EDUs, we propose rules for transforming an RST-DT to a dependency-based discourse tree (DEP-DT). [sent-84, score-0.768]

37 For each non-terminal node excluding the par- Figure 3: The DEP-DT obtained from the RST-DT in Figure 1. [sent-86, score-0.05]

38 Here, a ‘head’ of a non-terminal node is the leftmost descendant EDU whose parent is N. [sent-88, score-0.14]

39 For each EDU whose parent is N, we pick the nearest S with a ‘head’ from the EDU’s ancestors and we add the EDU to the DEP-DT as a child of the head of the S’s parent. [sent-91, score-0.312]

40 If there is no nearest S, the EDU is the root of the DEP-DT. [sent-92, score-0.058]

41 For example, in Figure 2, the nearest S to e3 that has a head is node 5 and the head of node 5’s parent is e2. [sent-93, score-0.369]

42 For each EDU whose parent is S, we pick the nearest non-terminal with a ‘head’ from the ancestors and we add the EDU to the DEP-DT as a child of the head of the non-terminal node. [sent-96, score-0.312]

43 For example, the nearest non-terminal node of e9 that has a head is node 16 and the head of node 16 is e10. [sent-97, score-0.356]

44 Therefore, we have to drop e7, e9 and e10 when we drop e8. [sent-101, score-0.082]

45 Note that, by applying the rules, discourse relations defined between non-terminals of an RST-DT are eliminated. [sent-102, score-0.195]

46 However, we believe that these re- lations are no needed for the summarization that we are attempting to realize. [sent-103, score-0.198]

47 1 Formalization We denote T as a set of all possible rooted subtrees obtained from a DEP-DT. [sent-105, score-0.097]

48 F(t) is the significance score for a rooted subtree t ∈ T and L is the maximum n fourm abe roro otfe dw sourbdtsr eaell otw ∈ed T i ann a summary. [sent-106, score-0.178]

49 aTxhieoptimal subtree t∗ is defined as follows: t∗ = argmaxF(t) s. [sent-107, score-0.081]

50 (3) E(t) is the set of EDUs contained in t, Depth(e) is the depth of an EDU e within the DEP-DT. [sent-111, score-0.068]

51 (4) w∈∑W(e) W(e) is the set of words contained in e and tf(w, D) is the term frequency of word w in a document D. [sent-114, score-0.058]

52 ∑ℓixi ≤ L (6) ∑i= ∑1 ∀i : xparent(i) ≥ xi ∀i : xi ∈ {0, 1}, (7) (8) 2A similar approach has been applied to sentence compression (Filippova and Strube, 2008). [sent-118, score-0.079]

53 N is the number of EDUs in a document, ℓi is the length (the number of words) of the i-th EDU, and parent(i) indicates the ID of the parent of the i-th EDU in the DEP-DT. [sent-124, score-0.063]

54 Constraint (6) ensures that the length of a summary is less than limit L. [sent-125, score-0.046]

55 Constraint (7) ensures that a summary is a rooted subtree of the DEP-DT. [sent-126, score-0.224]

56 In general, the Tree Knapsack Problem is NPhard, but fortunately we can obtain the optimal solution in a feasible time by using ILP solvers for documents of practical tree size. [sent-128, score-0.144]

57 1 Settings We conducted an experimental evaluation on the test collection for single document summarization evaluation contained in the RST Discourse Treebank (RST-DTB)(Carlson et al. [sent-131, score-0.256]

58 The average length of the reference summaries corresponds to about 10 % of the words in the source 3http : / /www . [sent-134, score-0.089]

59 We compared our method (TKP) with Marcu’s method (Marcu) (Marcu, 1998), a simple knapsack model (KP), a maximum coverage model (MCP) and a lead method (LEAD). [sent-140, score-0.394]

60 MCP is known to be a state-of-the-art method for multiple document summarization and we believe that MCP also performs well in terms of single document summarization. [sent-141, score-0.314]

61 LEAD is also a widely used summarizer that simply takes the first K textual units of the document. [sent-142, score-0.26]

62 Although this is a simple heuristic rule, it is known as a state-of-the-art summarizer (Nenkova and McKeown, 2011). [sent-143, score-0.056]

63 For Marcu’s method, we examined both the gold RST-DT and HILDA’s RST-DT. [sent-148, score-0.026]

64 For KP, we exclude constraint (7) from the ILP formulation of TKP and set the depth of all EDUs in equations (3) and (5) at 1. [sent-152, score-0.068]

65 For MCP, we use tf (equation (4)) as the word weight. [sent-153, score-0.038]

66 We evaluated the summarization systems with ROUGE version 1. [sent-154, score-0.198]

67 From the table, TKP(G) and Marcu(G) achieved 4Options used: -n 2 -s -m -x Reference: The Fuji apple may one day replace the Red Delicious as the number one U. [sent-163, score-0.236]

68 Since the Red Delicious has been over-planted and prices have dropped to new lows, the apple industry seems ready for change. [sent-166, score-0.286]

69 Although the Fuji is smaller and not as perfectly shaped as the Red Delicious, it is much sweeter, less mealy and has a longer shelf life. [sent-168, score-0.076]

70 Some fruit visionaries say the Fuji could someday tumble the Red Delicious from the top of America’s apple heap. [sent-171, score-0.338]

71 The Fuji could someday tumble the Red Delicious from the top of America’s apple heap. [sent-180, score-0.338]

72 More than twice as many Red Delicious apples are grown as the Golden variety, America’s No. [sent-183, score-0.04]

73 They still buy apples mainly for big, red good looks. [sent-192, score-0.123]

74 But no matter how much Japan gets under our skin, we’ll still have mom and apple pie. [sent-202, score-0.292]

75 A Japanese apple called the Fuji is cropping up in orchards the way Hondas did on U. [sent-204, score-0.236]

76 These results support the effectiveness of our method that utilizes the discourse structure. [sent-211, score-0.195]

77 The results confirm the effectiveness of our summarization model and trimming proposal for DEP-DT. [sent-214, score-0.33]

78 This implies that our method is more robust against discourse parser error than Marcu’s method. [sent-216, score-0.225]

79 Figure 4 shows the example summaries generated by TKP(G), Marcu(G), MCP and LEAD, respectively for an article, wsj 1 128. [sent-217, score-0.116]

80 Since TKP(G) and Marcu(G) utilize a discourse tree, the summary generated by TKP(G) is similar to that generated by Marcu(G) but it is different from those generated by MCP and LEAD. [sent-218, score-0.241]

81 1519 5 Conclusion This paper proposed rules for transforming an RSTDT to a DEP-DT to obtain the parent-child relationships between EDUs. [sent-219, score-0.09]

82 We treated a single document summarization method as a Tree Knapsack Problem, i. [sent-220, score-0.256]

83 the summarizer selects the best rooted subtree from a DEP-DT. [sent-222, score-0.234]

84 Building a discourse-tagged corpus in the framework of rhetorical structure theory. [sent-231, score-0.118]

85 A depth-first dynamic programming algorithm for the tree knapsack problem. [sent-236, score-0.475]

86 A novel discourse parser based on support vector machine classification. [sent-245, score-0.225]

87 HILDA: A discourse parser using support vector machine classification. [sent-260, score-0.225]

88 Heuristic and exact algorithms for the precedenceconstrained knapsack problem. [sent-289, score-0.331]

89 Text summarization model based on maximum coverage problem and its variant. [sent-293, score-0.225]

90 Text summarization model based on the budgeted median problem. [sent-298, score-0.302]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('tkp', 0.355), ('knapsack', 0.331), ('marcu', 0.304), ('apple', 0.236), ('mcp', 0.228), ('delicious', 0.199), ('summarization', 0.198), ('discourse', 0.195), ('fuji', 0.152), ('tree', 0.144), ('hilda', 0.132), ('trimming', 0.132), ('kp', 0.121), ('rhetorical', 0.118), ('units', 0.107), ('rst', 0.106), ('textual', 0.097), ('rooted', 0.097), ('edus', 0.094), ('rouge', 0.093), ('summaries', 0.089), ('satellite', 0.088), ('takamura', 0.088), ('edu', 0.088), ('nucleus', 0.084), ('red', 0.083), ('subtree', 0.081), ('elaboration', 0.081), ('formulate', 0.078), ('duverle', 0.076), ('rstdt', 0.076), ('shelf', 0.076), ('head', 0.074), ('combinatorial', 0.071), ('depth', 0.068), ('budgeted', 0.066), ('parent', 0.063), ('ilp', 0.058), ('nearest', 0.058), ('document', 0.058), ('mom', 0.056), ('summarizer', 0.056), ('child', 0.055), ('annear', 0.051), ('benjamini', 0.051), ('growers', 0.051), ('hegemony', 0.051), ('hernault', 0.051), ('nishino', 0.051), ('ripe', 0.051), ('samphaiboon', 0.051), ('someday', 0.051), ('tumble', 0.051), ('xparent', 0.051), ('industry', 0.05), ('node', 0.05), ('transforming', 0.048), ('summary', 0.046), ('hiroya', 0.044), ('anytime', 0.044), ('yasuda', 0.044), ('prendinger', 0.044), ('relationships', 0.042), ('america', 0.042), ('drop', 0.041), ('filippova', 0.04), ('apples', 0.04), ('ntt', 0.04), ('ll', 0.039), ('median', 0.038), ('tf', 0.038), ('radical', 0.038), ('manabu', 0.038), ('filatova', 0.038), ('cho', 0.038), ('lead', 0.036), ('ancestors', 0.035), ('wilcoxon', 0.035), ('japanese', 0.035), ('mann', 0.032), ('extractive', 0.031), ('nenkova', 0.031), ('optimization', 0.03), ('won', 0.03), ('masaaki', 0.03), ('carlson', 0.03), ('parser', 0.03), ('helmut', 0.029), ('sandra', 0.029), ('grow', 0.028), ('whose', 0.027), ('compression', 0.027), ('ldc', 0.027), ('wsj', 0.027), ('explicitly', 0.027), ('coverage', 0.027), ('charles', 0.026), ('dp', 0.026), ('examined', 0.026), ('xi', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

Author: Tsutomu Hirao ; Yasuhisa Yoshida ; Masaaki Nishino ; Norihito Yasuda ; Masaaki Nagata

2 0.1893838 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

Author: Jun-Ping Ng ; Min-Yen Kan ; Ziheng Lin ; Wei Feng ; Bin Chen ; Jian Su ; Chew Lim Tan

Abstract: In this paper we classify the temporal relations between pairs of events on an article-wide basis. This is in contrast to much of the existing literature which focuses on just event pairs which are found within the same or adjacent sentences. To achieve this, we leverage on discourse analysis as we believe that it provides more useful semantic information than typical lexico-syntactic features. We propose the use of several discourse analysis frameworks, including 1) Rhetorical Structure Theory (RST), 2) PDTB-styled discourse relations, and 3) topical text segmentation. We explain how features derived from these frameworks can be effectively used with support vector machines (SVM) paired with convolution kernels. Experiments show that our proposal is effective in improving on the state-of-the-art significantly by as much as 16% in terms of F1, even if we only adopt less-than-perfect automatic discourse analyzers and parsers. Making use of more accurate discourse analysis can further boost gains to 35%.

3 0.14870356 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

Author: Tengfei Ma ; Hiroshi Nakagawa

Abstract: Document summarization is an important task in the area of natural language processing, which aims to extract the most important information from a single document or a cluster of documents. In various summarization tasks, the summary length is manually defined. However, how to find the proper summary length is quite a problem; and keeping all summaries restricted to the same length is not always a good choice. It is obviously improper to generate summaries with the same length for two clusters of documents which contain quite different quantity of information. In this paper, we propose a Bayesian nonparametric model for multidocument summarization in order to automatically determine the proper lengths of summaries. Assuming that an original document can be reconstructed from its summary, we describe the ”reconstruction” by a Bayesian framework which selects sentences to form a good summary. Experimental results on DUC2004 data sets and some expanded data demonstrate the good quality of our summaries and the rationality of the length determination.

4 0.14588808 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

Author: Xian Qian ; Yang Liu

Abstract: Extractive summarization typically uses sentences as summarization units. In contrast, joint compression and summarization can use smaller units such as words and phrases, resulting in summaries containing more information. The goal of compressive summarization is to find a subset of words that maximize the total score of concepts and cutting dependency arcs under the grammar constraints and summary length constraint. We propose an efficient decoding algorithm for fast compressive summarization using graph cuts. Our approach first relaxes the length constraint using Lagrangian relaxation. Then we propose to bound the relaxed objective function by the supermodular binary quadratic programming problem, which can be solved efficiently using graph max-flow/min-cut. Since finding the tightest lower bound suffers from local optimality, we use convex relaxation for initialization. Experimental results on TAC2008 dataset demonstrate our method achieves competitive ROUGE score and has good readability, while is much faster than the integer linear programming (ILP) method.

5 0.12928079 65 emnlp-2013-Document Summarization via Guided Sentence Compression

Author: Chen Li ; Fei Liu ; Fuliang Weng ; Yang Liu

Abstract: Joint compression and summarization has been used recently to generate high quality summaries. However, such word-based joint optimization is computationally expensive. In this paper we adopt the ‘sentence compression + sentence selection’ pipeline approach for compressive summarization, but propose to perform summary guided compression, rather than generic sentence-based compression. To create an annotated corpus, the human annotators were asked to compress sentences while explicitly given the important summary words in the sentences. Using this corpus, we train a supervised sentence compression model using a set of word-, syntax-, and documentlevel features. During summarization, we use multiple compressed sentences in the integer linear programming framework to select . salient summary sentences. Our results on the TAC 2008 and 2011 summarization data sets show that by incorporating the guided sentence compression model, our summarization system can yield significant performance gain as compared to the state-of-the-art.

6 0.10565983 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

7 0.09643811 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

8 0.08401452 149 emnlp-2013-Overcoming the Lack of Parallel Data in Sentence Compression

9 0.077057719 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

10 0.068179548 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

11 0.064172454 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

12 0.056741409 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

13 0.055216551 152 emnlp-2013-Predicting the Presence of Discourse Connectives

14 0.051891301 58 emnlp-2013-Dependency Language Models for Sentence Completion

15 0.047590971 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic

16 0.046404958 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

17 0.046009608 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

18 0.043583311 176 emnlp-2013-Structured Penalties for Log-Linear Language Models

19 0.042802375 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation

20 0.038196675 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.143), (1, 0.03), (2, -0.021), (3, 0.172), (4, -0.07), (5, -0.041), (6, 0.2), (7, -0.173), (8, 0.027), (9, 0.056), (10, 0.061), (11, 0.007), (12, 0.017), (13, 0.048), (14, 0.021), (15, 0.104), (16, -0.04), (17, -0.005), (18, 0.006), (19, 0.09), (20, 0.056), (21, -0.101), (22, -0.038), (23, -0.021), (24, -0.01), (25, 0.035), (26, 0.053), (27, -0.102), (28, 0.037), (29, -0.139), (30, -0.039), (31, 0.074), (32, 0.132), (33, -0.04), (34, -0.09), (35, -0.232), (36, -0.177), (37, -0.038), (38, -0.068), (39, -0.006), (40, 0.119), (41, 0.074), (42, -0.104), (43, 0.012), (44, -0.045), (45, 0.077), (46, -0.074), (47, 0.056), (48, -0.047), (49, 0.153)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96952569 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

Author: Tsutomu Hirao ; Yasuhisa Yoshida ; Masaaki Nishino ; Norihito Yasuda ; Masaaki Nagata

2 0.62360454 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

Author: Ioannis Konstas ; Mirella Lapata

Abstract: In a language generation system, a content planner selects which elements must be included in the output text and the ordering between them. Recent empirical approaches perform content selection without any ordering and have thus no means to ensure that the output is coherent. In this paper we focus on the problem of generating text from a database and present a trainable end-to-end generation system that includes both content selection and ordering. Content plans are represented intuitively by a set of grammar rules that operate on the document level and are acquired automatically from training data. We develop two approaches: the first one is inspired from Rhetorical Structure Theory and represents the document as a tree of discourse relations between database records; the second one requires little linguistic sophistication and uses tree structures to represent global patterns of database record sequences within a document. Experimental evaluation on two domains yields considerable improvements over the state of the art for both approaches.

3 0.60216385 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

Author: Tengfei Ma ; Hiroshi Nakagawa

4 0.55763626 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

Author: Maria Liakata ; Simon Dobnik ; Shyamasree Saha ; Colin Batchelor ; Dietrich Rebholz-Schuhmann

Abstract: We present a method which exploits automatically generated scientific discourse annotations to create a content model for the summarisation of scientific articles. Full papers are first automatically annotated using the CoreSC scheme, which captures 11 contentbased concepts such as Hypothesis, Result, Conclusion etc at the sentence level. A content model which follows the sequence of CoreSC categories observed in abstracts is used to provide the skeleton of the summary, making a distinction between dependent and independent categories. Summary creation is also guided by the distribution of CoreSC categories found in the full articles, in order to adequately represent the article content. Fi- nally, we demonstrate the usefulness of the summaries by evaluating them in a complex question answering task. Results are very encouraging as summaries of papers from automatically obtained CoreSCs enable experts to answer 66% of complex content-related questions designed on the basis of paper abstracts. The questions were answered with a precision of 75%, where the upper bound for human summaries (abstracts) was 95%.

5 0.5408774 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

Author: Jun-Ping Ng ; Min-Yen Kan ; Ziheng Lin ; Wei Feng ; Bin Chen ; Jian Su ; Chew Lim Tan

6 0.52257168 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

7 0.47745574 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

8 0.35040635 65 emnlp-2013-Document Summarization via Guided Sentence Compression

9 0.33574703 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic

10 0.32988667 152 emnlp-2013-Predicting the Presence of Discourse Connectives

11 0.29914403 176 emnlp-2013-Structured Penalties for Log-Linear Language Models

12 0.29384932 10 emnlp-2013-A Multi-Teraflop Constituency Parser using GPUs

13 0.25564882 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

14 0.24156749 58 emnlp-2013-Dependency Language Models for Sentence Completion

15 0.23243859 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

16 0.23007181 43 emnlp-2013-Cascading Collective Classification for Bridging Anaphora Recognition using a Rich Linguistic Feature Set

17 0.22657931 149 emnlp-2013-Overcoming the Lack of Parallel Data in Sentence Compression

18 0.22227272 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation

19 0.22070234 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

20 0.21294877 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.073), (18, 0.02), (22, 0.033), (26, 0.011), (30, 0.06), (45, 0.015), (50, 0.016), (51, 0.099), (66, 0.046), (71, 0.03), (75, 0.019), (77, 0.031), (90, 0.01), (92, 0.394), (95, 0.044), (96, 0.014), (97, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74027687 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

Author: Tsutomu Hirao ; Yasuhisa Yoshida ; Masaaki Nishino ; Norihito Yasuda ; Masaaki Nagata

2 0.564785 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English

Author: Rebecca Dridan

Abstract: A precise syntacto-semantic analysis of English requires a large detailed lexicon with the possibility of treating multiple tokens as a single meaning-bearing unit, a word-with-spaces. However parsing with such a lexicon, as included in the English Resource Grammar, can be very slow. We show that we can apply supertagging techniques over an ambiguous token lattice without resorting to previously used heuristics, a process we call ubertagging. Our model achieves an ubertagging accuracy that can lead to a four to eight fold speed up while improving parser accuracy. 1 Introduction and Motivation Over the last decade or so, supertagging has become a standard method for increasing parser efficiency for heavily lexicalised grammar formalisms such as LTAG (Bangalore and Joshi, 1999), CCG (Clark and Curran, 2007) and HPSG (Matsuzaki et al., 2007). In each of these systems, fine-grained lexical categories, known as supertags, are used to prune the parser search space prior to full syntactic parsing, leading to faster parsing at the risk of removing necessary lexical items. Various methods are used to configure the degree of pruning in order to balance this trade-off. The English Resource Grammar (ERG; Flickinger (2000)) is a large hand-written HPSGbased grammar of English that produces finegrained syntacto-semantic analyses. Given the high level of lexical ambiguity in its lexicon, parsing with the ERG should therefore also benefit from supertagging, but while various attempts have shown possibilities (Blunsom, 2007; Dridan et al., 2008; Dridan, 2009), supertagging is still not a standard element in the ERG parsing pipeline. 1201 There are two main reasons for this. The first is that the ERG lexicon does not assign simple atomic categories to words, but instead builds complex structured signs from information about lemmas and lexical rules, and hence the shape and integration of the supertags is not straightforward. Bangalore and Joshi (2010) define a supertag as a primitive structure that contains all the information about a lexical item, including argument structure, and where the arguments should be found. Within the ERG, that information is not all contained in the lexicon, but comes from different places. The choice, therefore, of what information may be predicted prior to parsing and how it should be integrated into parsing is an open question. The second reason that supertagging is not standard with ERG processing is one that is rarely considered when processing English, namely ambiguous segmentation. In most mainstream English parsing, the segmentation of parser input into tokens that will become the leaves of the parse tree is considered a fixed, unambiguous process. While recent work (Dridan and Oepen, 2012) has shown that producing even these tokens is not a solved problem, the issue we focus on here is the ambiguous mapping from these tokens to meaning-bearing units that we might call words. Within the ERG lexicon are many multi-token lexical entries that are sometimes referred to as words-with-spaces. These multi-token entries are added to the lexicon where the grammarian finds that the semantics of a fixed expression is non-compositional and has the distributional properties of other single word entries. Some examples include an adverb-like all of a sudden, a prepositionlike for example and an adjective-like over and done with. Each of these entries create an segmentation ambiguity between treating the whole expression as a single unit, or allowing analyses comprising enProce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et.h ?oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is2t0ic1s–1212, tries triggered by the individual tokens. Previous supertagging research using the ERG has either used the gold standard tokenisation, hence making the task artificially easier, or else tagged the individual tokens, using various heuristics to apply multi-token tags to single tokens. Neither approach has been wholly satisfactory. In this work we avoid the heuristic approaches and learn a sequential classification model that can simultaneously determine the most likely segmentation and supertag sequences, a process we dub ubertagging. We also experiment with more fine- grained tag sets than have been previously used, and find that it is possible to achieve a level of ubertagging accuracy that can improve both parser speed and accuracy for a precise semantic parser. 2 Previous Work As stated above, supertagging has become a standard tool for particular parsing paradigms, but the definitions of a supertag, the methods used to learn them, and the way they are used in parsing varies across formalisms. The original supertags were 300 LTAG elementary trees, predicted using a fairly simple trigram tagger that provided a configurable number of tags per token, since the tagger was not accurate enough to make assigning a single tree viable parser input (Bangalore and Joshi, 1999). The C&C; CCG parser uses a more complex Maximum Entropy tagger to assign tags from a set of 425 CCG lexical categories (Clark and Curran, 2007). They also found it necessary to supply more than one tag per token, and hence assign all tags that have a probability within a percentage β of the most likely tag for each token. Their standard parser configuration uses a very restrictive β value initially, relax- ing it when no parse can be found. Matsuzaki et al. (2007) use a supertagger similar to the C&C; tagger alongside a CFG filter to improve the speed of their HPSG parser, feeding sequences of single tags to the parser until a parse is possible. As in the ERG, category and inflectional information are separate in the automatically-extracted ENJU grammar: their supertag set consists of 1361 tags constructed by combining lexical categories and lexical rules. Figure 1 shows examples of supertags from these three tag sets, all describing the simple transitive use of lends. 1202 S NP0↓ VP VNP1↓ lends (a) LTAG (S[dcl]\NP)/NP (b) CCG [NP.nom NP.acc]-singular3rd verb rule (c) ENJU HPSG Figure 1: Examples of supertags from LTAG, CCG and ENJU HPSG, for the word lends. The ALPINO system for parsing Dutch is the closest in spirit to our ERG parsing setup, since it also uses a hand-written HPSG-based grammar, including multi-token entries in its lexicon. Prins and van Noord (2003) use a trigram HMM tagger to calculate the likelihood of up to 2392 supertags, and discard those that are not within τ of the most likely tag. For their multi-token entries, they assign a constructed category to each token, so that instead of assigning prepos it ion to the expression met betrekking tot (“with respect to”), they use ( 1 prepo s it ion ) , ( 2 prepo s it i ) , on ( 3 prepos it ion ) . Without these constructed categories, they would only have 1365 supertags. Most previous supertagging attempts with the ERG have used the grammar’s lexical types, which describe the coarse-grained part of speech, and the subcategorisation of a word, but not the inflection. Hence both lends and lent have a possible lexical type v np*pp* t o le, which indicates a verb, with optional noun phrase and prepositional phrase arguments, where the preposition has the form to. , , , The number of lexical types changes as the grammar grows, and is currently just over 1000. Dridan (2009) and Fares (2013) experimented with other tag types, but both found lexical types to be the optimal balance between predictability and efficiency. Both used a multi-tagging approach dubbed selective tagging to integrate the supertags into the parser. This involved only applying the supertag filter when the tag probability is above a configurable threshold, and not pruning otherwise. For multi-token entries, both Blunsom (2007) and adve rb adve rb adve rb adve rb ditt o ditt o 1 adve rb 2 adve rb 3 adve rb all in all , , , Figure 2: Options for tagging parts of the multitoken adverb all in all separately. Dridan (2009) assigned separate tags to each token, with Blunsom (2007) assigning a special ditto tag all but the initial token of a multi-token entry, while Dridan (2009) just assigned the same tag to each token (leading to example in the expression for example receiving p np i le, a preposition-type cate- gory). Both of these solutions (demonstrated in Figure 2), as well as that of Prins and van Noord (2003), in some ways defeat one of the purposes of treating these expressions as fixed units. The grammarian, by assigning the same category to, for example, all of a sudden and suddenly, is declaring that these two expressions have the same distributional properties, the properties that a sequential classifier is trying to exploit. Separating the tokens loses that information, and introduces extra noise into the sequence model. Ytrestøl (2012) and Fares (2013) treat the multientry tokens as single expressions for tagging, but with no ambiguity. Ytrestøl (2012) manages this by using gold standard tokenisation, which is, as he states, the standard practice for statistical parsing, but is an artificially simplified setup. Fares (2013) is the only work we know about that has tried to predict the final segmentation that the ERG produces. We compare segmentation accuracy between our joint model and his stand-alone tokeniser in Section 6. Looking at other instances of joint segmentation and tagging leads to work in non-whitespace separated languages such as Chinese (Zhang and Clark, 2010) and Japanese (Kudo et al., 2004). While at a high level, this work is solving the same problem, the shape of the problems are quite different from a data point of view. Regular joint morphological analysis and segmentation has much greater ambiguity in terms of possible segmentations but, in most cases, less ambiguity in terms of labelling than our situation. This also holds for other lemmatisation and morphological research, such as Toutanova and Cherry (2009). While we drew inspiration from this 1203 a j - i le v nge Foreign r-t r dl r v prp ol r v pst ol r v - unacc le v np*l-epndpin*gto le increased w period pl av - s r -vp-po le as well. p vp i le w period pl as av - dg-v le r well. Figure 3: A selection from the 70 lexitems instantiated for Foreign lending increased as well. related area, as well as from the speech recognition field, differences in the relative frequency of observations and labels, as well as in segmentation ambiguity mean that conclusions found in these areas did not always hold true in our problem space. 3 The Parser The parsing environment we work with is the PET parser (Callmeier, 2000), a unification-based chart parser that has been engineered for efficiency with precision grammars, and incorporates subsumptionbased ambiguity packing (Oepen and Carroll, 2000) and statistical model driven selective unpacking (Zhang et al., 2007). Parsing in PET is divided in two stages. The first stage, lexical parsing, covers everything from tokenising the raw input string to populating the base of the parse chart with the appropriate lexical items, ready for the second syntactic parsing stage. In this work, we embed our ubertagging model between the two stages. By this point, the input has been segmented into what we call internal t okens, which broadly means — — splitting at whitespace and hyphens, and making ’s a separate token. These tokens are subject to a morphological analysis component which proposes possible inflectional and derivational rules based on word form, and then are used in retrieving possible lexical entries from the lexicon. The results of applying the appropriate lexical rules, plus affixation rules triggered by punctuation, to the lexical entries form a lexical item object, that for this work we dub a lexitem. Figure 3 shows some examples of lexitems instantiated after the lexical parsing stage when analysing Foreign lending increased as well. The pre-terminal labels on these subtrees are the lexical types that have previously been used as supertags for the ERG. For uninflected words, with no punctuation affixed, the lexical type is the only element in the lexitem, other than the word form (e.g. Foreign, as). In this example, we also see lexitems with inflectional rules (v prp ol r, v pst ol r), derivational rules (v nger-t r dl r) and punctuation affixation rules (w period pl r). These lexitems are put in to a chart, forming a lexical lattice, and it is over this lattice that we apply our ubertagging model, removing unlikely lexitems before they are seen by the syntactic parsing stage. 4 The Data The primary data sets we use in these experiments are from the 1.0 version of DeepBank (Flickinger et al., 2012), an HPSG annotation of the Wall Street Journal text used for the Penn Treebank (PTB; Marcus et al. (1993)). The current version has gold standard annotations for approximately 85% of the first 22 sections. We follow the recommendations of the DeepBank developers in using Sections 00–19 for training, Section 20 (WSJ20) for development and Section 21 (WSJ21) as test data. In addition, we use two further sources of training data: the training portions of the LinGO Redwoods Treebank (Oepen et al., 2004), a steadily growing collection of gold standard HPSG annotations in a variety of domains; and the Wall Street Journal section of the North American News Corpus (NANC), which has been parsed, but not manually annotated. This builds on observations by Prins and van Noord (2003), Dridan (2009) and Ytrestøl (2012) that even uncorrected parser output makes very good train- ing data for a supertagger, since the constraints in the parser lead to viable, if not entirely correct sequences. This allows us to use much larger training sets than would be possible if we required manually annotated data. In final testing, we also include two further data sets to observe how domain affects the contribution of the ubertagging. These are both from the test portion of the Redwoods Treebank: CatB, an essay about open-source software;1 and WeScience13, 1http : / / catb .org/ esr /writ ings / 1204 text from Wikipedia articles about Natural Language Processing from the WeScience project (Ytrestøl et al., 2009). Table 1 summarises the vital statistics of the data we use. With the focus on multi-token lexitems, it is instructive to see just how frequent they are. In terms of type frequency, almost 10% of the approximately 38500 lexical entries in the current ERG lexicon have more than one token in their canonical form.2 However, while this is a significant percentage of the lexicon, they do not account for the same percentage of tokens during parsing. An analysis of WSJ00:19 shows that approximately one third of the sentences had at least one multi-token lexitem in the unpruned lexical lattice, and in just under half of those, the gold standard analysis included a multi-word entry. That gives the multi-token lexitems the awkward property of being rare enough to be difficult for a statistical classifier to accurately detect (just under 1% of the leaves of gold parse trees contain multiple tokens), but too frequent to ignore. In addition, since these multi-token expressions have often been distinguished because they are non-compositional, failing to detect the multi-word usage can lead to a disproportionately adverse effect on the semantic analysis of the text. 5 Ubertagging Model Our ubertagging model is very similar to a standard trigram Hidden Markov Model (HMM), except that the states are not all of the same length. Our states are based on the lexitems in the lexical lattice produced by the lexical parsing stage of PET, and as such, can be partially overlapping. We formalise this be defining each state by its start position, end po- sition, and tag. This turns out to make our model equivalent to a type of Hidden semi-Markov Model called a segmental HMM in Murphy (2002). In a segmental HMM, the states are segments with a tag (t) and a length in frames (l). In our setup, the frames are the ERG internal tokens and the segments are the lexitems, which are the potential candidates cathedral-baz aar / by Eric S. Raymond 2While the parser has mechanisms for handling words unknown to the lexicon, with the current grammar these mechanisms will never propose a multi-token lexitem, and so only the multi-token entries explicitly in the lexicon will be recognised as such. Lexitems Data Set Source Use Gold? Trees All M-T WSJ00:19DeepBank 1.0 §00–19trainyes337836614516309 Redwoods RDeeedwpBooandks 1Tr.0ee §b0a0n–k1 train yes 39478 432873 6568 NANC LDC2008T15 train no 2185323 42376523 399936 WSJ20DeepBank 1.0 §20devyes172134063312 WSJ21DDeeeeppBBaannkk 11..00 §§2210testyes141427515253 WeScience13 RDeeedwpBooandks T1.r0ee §b2a1nk test yes 802 11844 153 CatB Redwoods Treebank test yes 608 11653 115 Table 1: Test, development and training data used in these experiments. The final two columns show the total number of lexitems used for training (All), as well as how many of those were multi-token lexitems (M-T). to become leaves of the parse tree. As indicated above, the majority of segments (over 99%) will be one frame long, but segments of up to four frames are regularly seen in the training data. A standard trigram HMM has a transition proba- bility matrix A, where the elements Aijk represent the probability P(k|ij), and an emission probability tmhaetr pirxo bBa bwilhitoys eP (elke|mije),nt asn Bjo r eemcoisrdsi othne p probabilities P(o|j). Given these matrices and a vector of obstieersve Pd( frames, vOen, th thee posterior probabilities or fo fe oacbhstate at frame v are calculated as:3 P(qv= qy|O) =αv(Pqy()Oβv)(qy) (1) where αv(qy) is the forward probability at frame v, given a current state qy (i.e. the probability of the observation up to v, given the state): = qy) Xαv(qxqy) αv (qy) ≡ P(O0:v |qv = αv(qxqy) (2) (3) Xqx = Bqyov Xαv−1(qwqx)Aqwqxqy (4) Xqw βv (qy) is the backwards probability at frame v, given a current state qy (the probability of the observation 3Since we will require per-state probabilities for integration the parser, we focus on the calculation of posterior probabilities, rather than determing the single best path. to 1205 from v, given the state): βv(qy) ≡ P(Ov+1:V|qv = Xβv(qxqy) = qy) (5) (6) Xqx βv(qxqy) = Xβv+1(qyqz)AqxqyqzBqzov+1 (7) Xqz and the probability of the full observation sequence is equal to the forward probability at the end of the sequence, or the backwards probability at the start of the sequence: P(O) = αV(hEi) = β0(hSi) (8) In implementation, our model varies only in what we consider the previous or next states. While v still indexes frames, qv now indicates a state that ends with frame v, and we look forwards and backwards to adjacent states, not frames, formally designated in terms of l, the length of the state. Hence, we modify equation (4): αv(qxqy) = BqyOv−l+1:v Xαv−l(qwqx)Aqwqxqy Xqw (9) where v−l indexes the frame before the current state starts, va−ndl nhedencxee we are summing over arelln st tsattaetes that lead directly to our current state. An equivalent modification to equation (7) gives: βv(qxqy) = X Xβv+l(qyqz)AqxqyqzBqzOv+1:v+l ∈XQqznXl(qz) (10) LTTyYpPeEv np-pp*to leExample#1T0a2g8s INFL v np-pp * t o le :v pas odl r FULL v np-pp*to le :v pas odlr :w period plr 3626 21866 wv pe praiso oddl prlr l v np-pp*to le recommended. Figure 4: Possible tag types and their tag set size, with examples derived from the lexitem on the right. where Qn is the set of states that start at v + 1(i.e., the states immediately following the current state), and l(qz) is the length of state qz. We construct the transition and emission probability matrices using relative frequencies directly observed from the training data, where we make the simplifying assumption that P(qk |qiqj) ≡ P(t(qk) |t(qi)t(qk)). Which is to say, w|qhile lex≡items w)|itt(hq the same tag, but different length will trigger distinct states with distinct emission probabilities, they will have the same transition probabilities, given the same proceeding tag.4 Even with our large training set, some tag trigrams are rare or unseen. To smooth these probabilities, we use deleted interpolation to calculate a weighted sum of the trigram, bigram and unigram probabilities, since it has been successfully used in effective PoS taggers like the TnT tagger (Brants, 2000). Future work will look more closely at the effects of different smoothing methods. 6 Intrinsic Ubertag Evaluation In order to develop and tune the ubertagging model, we first looked at segmentation and tagging performance in isolation over the development set. We looked at three tag granularities: lexical types (LTYPE) which have previously been shown to be the optimal granularity for supertagging with the ERG, inflected types (INFL) which encompass inflectional and derivational rules applied to the lexical type, and the full lexical item (FULL), which also includes affixation rules used for punctuation handling. Examples of each tag type are shown in Figure 4, along with the number of tags of each type seen in the training data. 4Since the multi-token lexical entries are defined because they have the same properties as the single token variants, there is no reason to think the length of a state should influence the tag sequence probability. 1206 Tag Type Segmentation F1 Sent. Tagging F1 Sent. FULL99.5594.4893.9242.13 INFL LTYPE 99.45 99.40 93.55 93.03 93.74 93.27 41.49 38.12 Table 2: Segmentation and tagging performance of the best path found for each model, measured per segment in terms of F1, and also as complete sentence accuracy. Single sequence results Table 2 shows the results when considering the best path through the lattice. In terms of segmentation, our sentence accuracy is comparable to that of the stand-alone segmentation performance reported by Fares et al. (2013) over similar data.5 In that work, the authors used a binary CRF classifier to label points between objects they called micro-tokens as either SPLIT or NOSPLIT. The CRF classifier used a less informed input (since it was external to the parser), but a much more complex model, to produce a best single path sentence accuracy of 94.06%. Encouragingly, this level of segmentation performance was shown in later work to produce a viable parser input (Fares, 2013). Switching to the tagging results, we see that the F1 numbers are quite good for tag sets of this size.6 The best tag accuracy seen for ERG LTYPE-style tags was 95.55 in Ytrestøl (2012), using gold standard segmentation on a different data set. Dridan (2009) experimented with a tag granularity similar to our INFL (letype+morph) and saw a tag accuracy of 91.51, but with much less training data. From other formalisms, Kummerfeld et al. (2010) 5Fares et al. (2013) used a different section of an earlier version of DeepBank, but with the same style of annotation. 6We need to measure F1 rather than tag accuracy here, since the number of tokens tagged will vary according to the segmentation. report a single tag accuracy of 95.91, with the smaller CCG supertag set. Despite the promising tag F1 numbers however, the sentence level accuracy still indicates a performance level unacceptable for parser input. Comparing between tag types, we see that, possibly surprisingly, the more fine-grained tags are more accurately assigned, although the differences are small. While instinctively a larger tag set should present a more difficult problem, we find that this is mitigated both by the sparse lexical lattice provided by the parser, and by the extra constraints provided by the more informative tags. Multi-tagging results The multi-tagging methods from previous supertagging work becomes more complicated when dealing with ambiguous tokenisation. Where, in other setups, one can compare tag probabilities for all tags for a particular token, that no longer holds directly when tokens can partially overlap. Since ultimately, the parser uses lexitems which encompass segmentation and tagging information, we decided to use a simple integration method, where we remove any lexitem which our model assigns a probability below a certain threshold (ρ). The effect of the different tag granularities is now mediated by the relationship between the states in the ubertagging lattice and the lexitems in the parser’s lattice: for the FULL model, this is a one-to-one relationship, but states from the models that use coarser-grained tags may affect multiple lexitems. To illustrate this point, Figure 5 shows some lexitems for the token forecast,, where there are multiple possible analyses for the comma. A FULL tag of v cp le :v p st olr :w comma pl r will select only lexitem (b), whereas an INFL tag v cp le :v pst ol r will select (b) and (c) and the LTYPE tag v cp le picks out (a), (b) and (c). On the other hand, where there is no ambiguity in inflection or affixation, an LTYPE tag of n - mc le may relate to only a single lexitem ((f) in this case). Since we are using an absolute, rather than relative, threshold, the number needs to be tuned for each model7 and comparisons between models can only be made based on the effects (accuracy or pruning power) of the threshold. Table 3 shows how a selection of threshold values affect the accuracy 7A tag set size of 1028 will lead to higher probabilities in general than a tag set size of 21866. 1207 w comma-nf pl r w comma pl r w comma-n f pl r v pst ol r v pst o l r v cp le v cp le v cp le forecast, (a) w comma pl r forecast, (b) w comma pl r forecast, (c) v p st ol r v pas o l r w comma pl r v np le v np le n - mc le forecast, (d) forecast, (e) forecast, (f) Figure 5: Some of the lexitems triggered by forecast, in Despite the gloomy forecast, profits were up. Tag Type Lexitems ρ Acc. Kept Ave. FULL0.0000199.7141.63.34 FULL FULL FULL 0.0001 0.001 0.01 99.44 98.92 97.75 33.1 25.5 19.4 2.66 2.05 1.56 INFL0.000199.6737.93.04 INFL INFL INFL 0.001 0.01 0.02 99.25 98.21 97.68 29.0 21.6 19.7 2.33 1.73 1.58 LTYPE0.000299.7566.35.33 LTYPE LTYPE LTYPE 0.002 0.02 0.05 99.43 98.41 97.54 55.0 43.5 39.4 4.42 3.50 3.17 Table 3: Accuracy and ambiguity after pruning lexitems in WSJ20, at a selection of thresholds ρ for each model. Accuracy is measured as the percentage of gold lexitems remaining after pruning, while ambiguity is presented both as a percentage of lexitems kept, and the average number of lexitems per initial token still remaining. Tag accuracy versus ambiguity Average lexitems per initial token Figure 6: Accuracy over gold lexitems versus average lexitems per initial token over the development set, for each of the different ubertagging models. and pruning impact of our different disambiguation models, where the accuracy is measured in terms of percentage of gold lexitems retained. The pruning effect is given both as percentage of lexitems retained after pruning, and average number of lexitems per initial token.8 Comparison between the different models can be more easily made by examining Figure 6. Here we see clearly that the LTYPE model provides much less pruning for any given level of lexitem accuracy, while the performance of the other models is almost indistinguishable. Analysis The current state-of-the-art POS tagging accuracy (using the 45 tags in the PTB) is approximately 97.5%. The most restrictive ρ value we report for each model was selected to demonstrate that level of accuracy, which we can see would lead to pruning over 80% of lexitems when using FULL tags, an average of 1.56 tags per token. While this level of accuracy has been sufficient for statistical treebank parsing, previous work (Dridan, 2009) has shown that tag accuracy cannot directly predict parser performance, since errors of different types can have very different effects. This is hard to quantify without parsing, but we made a qualitative analysis at the lexitems that were incorrectly being 8The average number of lexitems per token for the unrestricted parser is 8.03, although the actual assignment is far from uniform, with up to 70 lexitems per token seen for the very ambiguous tokens. 1208 pruned. For all models, the most difficult lexitems to get correct were proper nouns, particular those that are also used as common nouns (e.g. Bank, Airline, Report). While capitalisation provides a clue here, it is not always deterministic, particularly since the treebank incorporates detailed decisions regarding the distinction between a name and a capitalised common noun that require real world knowledge, and are not necessarily always consistent. Almost two thirds of the errors made by the FULL and INFL models are related to these decisions, but only about 40% for the LTYPE model. The other errors are predominately over noun and verb type lexitems, as the open classes, with the only difference between models being that the FULL model seems marginally better at classifying verbs. The next section describes the end-to-end setup and results when parsing the development set. 7 Parsing With encouraging ubertagging results, we now take the next step and evaluate the effect on end-to-end parsing. Apart from the issue of different error types having unpredictable effects, there are two other factors that make the isolated ubertagging results only an approximate indication of parsing performance. The first confounding factor is the statistical parsing disambiguation model. To show the effect of ubertagging in a realistic configuration, we only evaluate the first analysis that the parser returns. That means that when the unrestricted parser does not rank the gold analysis first, errors made by our model may not be visible, because we would never see the gold analysis in any case. On the other hand, it is possible to improve parser accuracy by pruning incorrect lexitems that were in a top ranked, nongold analysis. The second new factor that parser integration brings to the picture is the effect of resource limitations. For reasons of tractability, PET is run with per sentence time and memory limits. For treebank creation, these limits are quite high (up to four minutes), but for these experiments, we set the timeout to a more practical 60 seconds and the memory limit to 2048Mb. Without lexical pruning, this leads to approximately 3% of sentences not receiving an analysis. Since the main aim of ubertagging is to inTag F1 Type ρ Lexitem Bracket Time No Pruning94.0688.586.58 FULL0.0000195.6289.843.99 FULL FULL FULL 0.0001 0.001 0.01 95.95 95.81 94.19 90.09 89.88 88.29 2.69 1.34 0.64 INFL0.000196.1090.373.45 INFL INFL INFL 0.001 0.01 0.02 96.14 95.07 94.32 90.33 89.27 88.49 1.78 0.84 0.64 LTYPE0.000295.3789.634.73 LTYPE LTYPE LTYPE 0.002 0.02 0.05 96.03 95.04 93.36 90.20 89.04 87.26 2.89 1.23 0.88 Table 4: Lexitem and bracket F1over WSJ20, with average per sentence parsing time in seconds. crease efficiency, we would expect to regain at least some of these unanalysed sentences, even when a lexitem needed for the gold analysis has been removed. Table 4 shows the parsing results at the same threshold values used in Table 3. Accuracy is calculated in terms of F1 both over lexitems, and PARSEVAL-style labelled brackets (Black et al., 1991), while efficiency is represented by average parsing time per sentence. We can see here that an ubertagging F1 of below 98 (cf. Table 3) leads to a drop in parser accuracy, but that an ubertagging performance of between 98 and 99 can improve parser F1 while also achieving speed increases up to 8-fold. From the table we confirm that, contrary to earlier pipeline supertagging configurations, tags of a finer granularity than LTYPE can deliver better performance, both in terms of accuracy and efficiency. Again, comparing graphically in Figure 7 gives a clearer picture. Here we have graphed labelled bracket F1 against parsing time for the full range of threshold values explored, with the unpruned parsing results indicated by a cross. From this figure, we see that the INFL model, despite being marginally less accurate when measured in isolation, leads to slightly more accurate parse results than the FULL model at all levels of efficiency. Looking at the same graph for different samples of the development set (not shown) shows some 1209 Parser accuracy versus efficiency Time per sentence Figure 7: Labelled bracket F1 versus parsing time per sentence over the development set, for each of the different ubertagging models. The cross indicates unpruned performance, while the circle pinpoints the configuration we chose for the final test runs. variance in which threshold value gives the best F1, but the relative differences and basic curve shape re- mains the same. From these different views, using the guideline of maximum efficiency without harming accuracy we selected our final configuration: the INFL model with a threshold value of 0.001 (marked with a circle in Figure 7). On the development set, this configuration leads to a 1.75 point improvement in F1 in 27% of the parsing time. 8 Final Results Table 5 shows the results obtained when parsing using the configuration selected on the development set, over our three test sets. The first, WSJ21 is from the same domain as the development set. Here we see that the effect over the WSJ21 set fairly closely mirrored that of the development set, with an F1 increase of 1.81 in 29% of the parsing time. The Wikipedia domain of our WeScience13 test set, while very different to the newswire domain of the development set could still be considered in domain for the parsing and ubertagging models, since there is Wikipedia data in the training sets. With an average sentence length of 15.18 (compared to 18.86 in WSJ21), the baseline parsing time is faster than for WSJ21, and the speedup is not quite as large Data Set Baseline F1 Time Pruned F1 Time WSJ2188.126.0689.931.77 WeScience13 CatB 86.25 86.31 4.09 5.00 87.14 87.1 1 1.48 1.78 Table 5: Parsing accuracy in terms of labelled bracket F1 and average time per sentence when parsing the test sets, without pruning, and then with lexical pruning using the INFL model with a threshold of 0.001. but still welcome, at 36% of the baseline time. The increase is accuracy is likewise smaller (due to less issues with resource exhaustion in the baseline), but as our primary goal is to not harm accuracy, the results are pleasing. The CatB test set is the standard out-of-domain test for the parser, and is also out of domain for the ubertagging model. The average sentence length is not much below that of WSJ21, at 18.61, but the baseline parsing speed is still noticeably faster, which appears to be a reflection of greater structural ambiguity in the newswire text. We still achieve a reduction in parsing time to 35% of the baseline, again with a small improvement in accuracy. The across-the-board performance improvement on all our test sets suggests that, while tuning the pruning threshold could help, it is a robust parameter that can provide good performance across a variety of domains. This means that we finally have a robust supertagging setup for use with the ERG that doesn’t require heuristic shortcuts and can be reliably applied in general parsing. 9 Conclusions and Outlook In this work we have demonstrated a lexical disambiguation process dubbed ubertagging that can assign fine-grained supertags over an ambiguous token lattice, a setup previously ignored for English. It is the first completely integrated supertagging setup for use with the English Resource Grammar, which avoids the previously necessary heuristics for dealing with ambiguous tokenisation, and can be robustly configured for improved performance without loss of accuracy. Indeed, by learning a joint segmentation and supertagging model, we have been able to achieve usefully high tagging accuracies for very 1210 fine-grained tags, which leads to potential parser speedups of between 4 and 8 fold. Analysis of the tagging errors still being made have suggested some possibly avoidable inconsistencies in the grammar and treebank, which have been fed back to the developers, hopefully leading to even better results in the future. In future work, we will investigate more advanced smoothing methods to try and boost the ubertagging accuracy. We also intend to more fully explore the domain adaptation potentials of the lexical model that have been seen in other parsing setups (see Rimell and Clark (2008) for example), as well as examine the limits on the effects of more training data. Finally, we would like to explore just how much the statistic properties of our data dictate the success of the model by looking at related problems like morphological analysis of unsegmented languages such as Japanese. Acknowledgements Iam grateful to my colleagues from the Oslo Language Technology Group and the DELPH-IN consortium for many discussions on the issues involved in this work, and particularly to Stephan Oepen who inspired the initial lattice tagging idea. Thanks also to three anonymous reviewers for their very constructive feedback which improved the final version. Large-scale experimentation and engineering is made possible though access to the TITAN highperformance computing facilities at the University of Oslo, and Iam grateful to the Scientific Computating staff at UiO, as well as to the Norwegian Metacenter for Computational Science and the Norwegian tax payer. References Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: an approach to almost parsing. Computational Linguistics, 25(2):237 –265. Srinavas Bangalore and Aravind Joshi, editors. 2010. Supertagging: Using Complex Lexical Descriptions in Natural Language Processing. The MIT Press, Cambridge, US. Ezra Black, Steve Abney, Dan Flickinger, Claudia Gdaniec, Ralph Grishman, Phil Harrison, Don Hindle, Robert Ingria, Fred Jelinek, Judith Klavans, Mark Liberman, Mitch Marcus, S. Roukos, Beatrice Santorini, and Tomek Strzalkowski. 1991. A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of the Workshop on Speech and Natural Language, page 306 311, Pacific Grove, USA. Philip Blunsom. 2007. Structured Classification for Multilingual Natural Language Processing. Ph.D. thesis, Department of Computer Science and Software Engineering, University of Melbourne. Thorsten Brants. 2000. TnT a statistical part-ofspeech tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing ANLP-2000, page 224 –23 1, Seattle, USA. Ulrich Callmeier. 2000. PET. A platform for experimentation with efficient HPSG processing techniques. Natural Language Engineering, 6(1):99 108, March. Stephen Clark and James R. Curran. 2007. Formalismindependent parser evaluation with CCG and DepBank. In Proceedings of the 45th Meeting of the Association for Computational Linguistics, page 248 255, Prague, Czech Republic. Rebecca Dridan and Stephan Oepen. 2012. Tokenization. Returning to a long solved problem. A survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Meeting of the Association for Computational Linguistics, page 378 382, Jeju, Republic of Korea, July. Rebecca Dridan, Valia Kordoni, and Jeremy Nicholson. 2008. Enhancing performance of lexicalised grammars. page 613 621. – — – – – – Rebecca Dridan. 2009. Using lexical statistics to improve HPSG parsing. Ph.D. thesis, Department of Computational Linguistics, Saarland University. Murhaf Fares, Stephan Oepen, and Yi Zhang. 2013. Machine learning for high-quality tokenization. Replicating variable tokenization schemes. In Computational Linguistics and Intelligent Text Processing, page 23 1 244. Springer. Murhaf Fares. 2013. ERG tokenization and lexical categorization: a sequence labeling approach. Master’s thesis, Department of Informatics, University of Oslo. – 1211 Dan Flickinger, Yi Zhang, and Valia Kordoni. 2012. DeepBank. A dynamically annotated treebank of the Wall Street Journal. In Proceedings of the 11th International Workshop on Treebanks and Linguistic Theories, page 85 –96, Lisbon, Portugal. Edi ¸c˜ oes Colibri. Dan Flickinger. 2000. On building a more efficient grammar by exploiting types. Natural Language Engineering, 6 (1): 15 28. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to japanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, page 230 237. Jonathan K. Kummerfeld, Jessika Roesner, Tim Daw– – born, James Haggerty, James R. Curran, and Stephen Clark. 2010. Faster parsing by supertagger adaptation. In Proceedings of the 48th Meeting of the Association for Computational Linguistics, page 345 355, Uppsala, Sweden. Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics, 19:3 13 –330. Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2007. Efficient HPSG parsing with supertagging and CFG-filtering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2007), page 1671 1676, Hyderabad, India. Kevin P. Murphy. 2002. Hidden semi-Markov models (HSMMs). Stephan Oepen and John Carroll. 2000. Ambiguity packing in constraint-based parsing. Practical results. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, page 162 169, Seattle, WA, USA. Stephan Oepen, Daniel Flickinger, Kristina Toutanova, and Christopher D. Manning. 2004. LinGO Redwoods. A rich and dynamic treebank for HPSG. Research on Language and Computation, 2(4):575 596. Robbert Prins and Gertjan van Noord. 2003. Reinforcing parser preferences through tagging. Traitement Au– – – – des Langues, 44(3): 121 139. Laura Rimell and Stephen Clark. 2008. Adapting a lexicalized-grammar parser to contrasting domains. page 475 –484. Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the 47th Meeting of the Association for Computational Linguistics, page 486 494, Singapore. Gisle Ytrestøl. 2012. Transition-based Parsing for Large-scale Head-Driven Phrase Structure Grammars. Ph.D. thesis, Department of Informatics, University of Oslo. tomatique – – Gisle Ytrestøl, Stephan Oepen, and Dan Flickinger. 2009. Extracting and annotating Wikipedia subdomains. In Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories, page 185 197, Groningen, The Netherlands. Yue Zhang and Stephen Clark. 2010. A fast decoder for joint word segmentation and POS-tagging using a single discriminative model. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, page 843 852, Cambridge, MA, USA. Yi Zhang, Stephan Oepen, and John Carroll. 2007. Efficiency in unification-based n-best parsing. In Proceedings of the 10th International Conference on Parsing Technologies, page 48 59, Prague, Czech Republic, July. – – – 1212

3 0.54913902 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

Author: Svitlana Volkova ; Theresa Wilson ; David Yarowsky

Abstract: Theresa Wilson Human Language Technology Center of Excellence Johns Hopkins University Baltimore, MD t aw@ j hu .edu differences may Different demographics, e.g., gender or age, can demonstrate substantial variation in their language use, particularly in informal contexts such as social media. In this paper we focus on learning gender differences in the use of subjective language in English, Spanish, and Russian Twitter data, and explore cross-cultural differences in emoticon and hashtag use for male and female users. We show that gender differences in subjective language can effectively be used to improve sentiment analysis, and in particular, polarity classification for Spanish and Russian. Our results show statistically significant relative F-measure improvement over the gender-independent baseline 1.5% and 1% for Russian, 2% and 0.5% for Spanish, and 2.5% and 5% for English for polarity and subjectivity classification.

4 0.33667892 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a novel approach to determine textual similarity. A layered methodology to transform text into logic forms is proposed, and semantic features are derived from a logic prover. Experimental results show that incorporating the semantic structure of sentences is beneficial. When training data is unavailable, scores obtained from the logic prover in an unsupervised manner outperform supervised methods.

5 0.33531696 61 emnlp-2013-Detecting Promotional Content in Wikipedia

Author: Shruti Bhosale ; Heath Vinicombe ; Raymond Mooney

Abstract: This paper presents an approach for detecting promotional content in Wikipedia. By incorporating stylometric features, including features based on n-gram and PCFG language models, we demonstrate improved accuracy at identifying promotional articles, compared to using only lexical information and metafeatures.

6 0.33481717 65 emnlp-2013-Document Summarization via Guided Sentence Compression

7 0.33381653 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

8 0.33236399 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

9 0.33150882 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

10 0.33144796 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

11 0.33011758 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

12 0.32803714 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

13 0.32762524 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

14 0.32757765 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

15 0.32722795 143 emnlp-2013-Open Domain Targeted Sentiment

16 0.32644376 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

17 0.32632259 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

18 0.32292569 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

19 0.32165694 83 emnlp-2013-Exploring the Utility of Joint Morphological and Syntactic Learning from Child-directed Speech

20 0.32157037 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification