emnlp emnlp2013 emnlp2013-147 knowledge-graph by maker-knowledge-mining

147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

Source: pdf

Author: Lifu Huang ; Lian'en Huang

Abstract: Recently, much research focuses on event storyline generation, which aims to produce a concise, global and temporal event summary from a collection of articles. Generally, each event contains multiple sub-events and the storyline should be composed by the component summaries of all the sub-events. However, different sub-events have different part-whole relationship with the major event, which is important to correspond to users’ interests but seldom considered in previous work. To distinguish different types of sub-events, we propose a mixture-event-aspect model which models different sub-events into local and global aspects. Combining these local/global aspects with summarization requirements together, we utilize an optimization method to generate the component summaries along the timeline. We develop experimental systems on 6 distinctively different datasets. Evaluation and comparison results indicate the effectiveness of our proposed method.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Recently, much research focuses on event storyline generation, which aims to produce a concise, global and temporal event summary from a collection of articles. [sent-5, score-1.321]

2 Generally, each event contains multiple sub-events and the storyline should be composed by the component summaries of all the sub-events. [sent-6, score-0.808]

3 To distinguish different types of sub-events, we propose a mixture-event-aspect model which models different sub-events into local and global aspects. [sent-8, score-0.24]

4 Combining these local/global aspects with summarization requirements together, we utilize an optimization method to generate the component summaries along the timeline. [sent-9, score-0.617]

5 So how to get a concise and global picture for a given event subject is an urgent problem to be solved. [sent-14, score-0.491]

6 cn multi-document summarization systems, to generate a compressed summary by extracting the major information from the collection of documents, they ignored the dynamic development information of an event. [sent-20, score-0.4]

7 Intuitively, each event is long-running and contains multiple sub-events, including related events. [sent-21, score-0.317]

8 This motivates us to study the task of generating event storyline from a collection of web documents related to an event subject. [sent-23, score-1.059]

9 The research of event storyline summarization is popular in recent years. [sent-24, score-0.767]

10 Its task is to summarize a collection of web documents by extracting representative information based on all the sub-events and generate a global summary. [sent-25, score-0.263]

11 Generally, generating such a global storyline is quite interesting for the following main reasons: (1) It can help people catch the whole incident based on an overall temporal structured summary for a given subject, and understand the cause, climax, development process and result of an event. [sent-26, score-0.609]

12 (2) It can also make people know what other events are related, or the effect of this incident to subsequent events, which can present the evolution of an event along a timeline. [sent-27, score-0.444]

13 Though several methods of generating event storyline have been proposed recently, there are still some problems unresolved. [sent-28, score-0.628]

14 As event storyline summarization is a process to generate component summaries based on the multiple sub-events, which is different from traditional summarization focusing on only one subject, so how to exactly extract all the sub-events is the first challenge. [sent-29, score-1.086]

15 hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 7t2ic6s–735, consistency with the given event subject, so the subevents should not be considered equally when generating the component summaries. [sent-32, score-0.516]

16 The component summaries should be correlative across different dates based on the global collection (Yan et al. [sent-34, score-0.394]

17 To be different, in this paper we introduce “local/global” property to distinguish different part-whole relationship between the sub-events and the major event, which have not been considered before in storyline generation or summarization, to improve the quality of the storyline. [sent-37, score-0.517]

18 While other sub-events often share common properties with each other and have close relationship with the major event and we call them as “global-sub-events”. [sent-42, score-0.434]

19 For the event “Connecticut school shooting” which occurred on Dec. [sent-44, score-0.38]

20 Inspired by these, to detect different types of subevents based on word co-occurrences between subevents and the major event, we propose a mixtureevent-aspect (MEA) model to formalize different types of sub-events into local/global aspects, which are implicated with clusters of sentences. [sent-46, score-0.467]

21 Then combining the local/global aspects with summarization requirements together, we utilized an optimization approach to get the optimal component summaries along the timeline. [sent-47, score-0.651]

22 In section 3 we present the details of optimized event storyline generation based on mixture-event-aspect model. [sent-52, score-0.676]

23 2 Related Work Our work is related to several lines of research in the literature including multi-document summarization (MDS), topic detection and tracking (TDT), temporal text mining (TXM) and temporal news summarization (TNS). [sent-55, score-0.543]

24 Multi-document summarization is a process to generate a summary by reducing documents in size while retaining the main information. [sent-56, score-0.232]

25 Kumaran and Allan (Kumaran and Allan, 2004) showed how performance on new event detection could be improved by the use of text classification techniques as well as by using named entities in a new way. [sent-68, score-0.317]

26 The task of temporal news summarization is to generate news sum- maries along the timeline from massive data. [sent-80, score-0.438]

27 (Chieu and Lee, 2004) built a system that extracted events relevant to a query from a collection of related documents and placed such events along a timeline. [sent-82, score-0.256]

28 , 2011b) designed an evolutionary timeline summarization approach to construct a timeline of a topic by optimizing the relevance, coverage, coherence, and diversity. [sent-85, score-0.393]

29 3 Approach Details In this section, we first propose a mixture-eventaspect model to detect local/global sub-events based on part-whole relationship with the major event and then present a new method to estimate the bursty of each aspect on a certain date. [sent-90, score-0.984]

30 Afterwards we utilize an optimization method based on local/global aspects to extract the qualified summary. [sent-91, score-0.304]

31 1 Mixture-Event-Aspect Model The key challenge to our storyline generation task is to detect and distinguish different types of subevents contained in the article collection. [sent-93, score-0.594]

32 In the collection, each sentence is assigned with a certain date and sentences that are assigned with the same date are grouped into the same sub-collection. [sent-94, score-0.394]

33 Considering the consistency of content between the subevents and the major event, we model different subevents into two types: local-sub-event and globalsub-event, and introduce local/global aspects correspondingly. [sent-95, score-0.546]

34 Generally, local aspects which correspond to local-sub-events have distinctive words distribution from each other and sustain for a local context while the global aspects corresponding 728 to global-sub-events have coincident words distribution with the major event. [sent-96, score-0.851]

35 Inspired by these ideas, we rely on word co-occurrences within local period context to detect mixed local and global aspects implicated in the whole collection. [sent-100, score-0.662]

36 We name this model as “MixtureEvent-Aspect (MEA)” model which can simultaneously detect local/global aspects and cluster sentences and words into different aspects. [sent-101, score-0.309]

37 We model two distinct types of aspects: global aspects and local aspects, based on their relationship with the major event. [sent-106, score-0.52]

38 The distribution of global aspects is fixed for the collection while the distribution of local aspects is fixed to a local period of sub-collections. [sent-107, score-0.894]

39 That means a sentence is sampled either from the mixture of the global aspects or from the local aspects specific for the local context. [sent-108, score-0.741]

40 Here we take the event “Connecticut school shooting” as an example. [sent-109, score-0.38]

41 and calling for an end to such incidents”, the words such as “Obama”, “lecture”, “express” are only occurred for the local period of two days and have no co-occurrence with other neighboring period sentences, so we sample the sentence as a local aspect sentence. [sent-113, score-0.612]

42 state of Connecticut were in lockdown after a shooting was reported at a local elementary school”, the words such as “Connecticut”, “shooting”, “elementary” have high co-occurrence frequency in the whole collection, so we sample the sentence as a global aspect sentence. [sent-116, score-0.63]

43 To detect aspects, we first divide words into two types: aspect words and background words. [sent-117, score-0.322]

44 Background words are commonly used in the whole event corpus while aspect words are clearly associated with the aspects of the sentences they occur in. [sent-118, score-0.807]

45 We associate each time window with a distribution over local aspects and a distribution defining preference of local aspects versus global aspects. [sent-121, score-0.783]

46 e, event subject, Ct represents the collection of sentences which are assigned with the date t. [sent-128, score-0.638]

47 h Wiceh generates words for all sub-collections, and draw Agl global aspect unigram language models for global aspects and Aloc word distributions for local aspects. [sent-134, score-0.804]

48 There is also a multinomial distribution π that controls in each sentence how often the word occurs as a background word or an aspect word. [sent-136, score-0.307]

49 We assign each window v with an distribution over local aspects and a distribution ρ defining preference for local aspects versus global aspects. [sent-138, score-0.783]

50 Figure 2: Mixture-Event-Aspect Model Let SC denotes the number of sentences in collection C, Nc,s denotes the number of words in sentence s of collection c, and wc,s,n denotes the nth word in sentence s. [sent-143, score-0.353]

51 There are two kinds of hidden variables: zc,s for each sentence to indicate the aspect a sentence belongs to, and yc,s,n for each word to indicate whether a word is generated from the background model or the aspect model. [sent-144, score-0.573]

52 tG tihveen a sentence s in the collection c, we apply Gibbs Sampling to estimate the conditional probability for local/global aspects using the following rules: p(vc,s = vh,ηc,s = gl, zc,s = a|v′, z′,y,w) ∝ n(cn. [sent-148, score-0.353]

53 ncg,lvh and nlco,cvh are the number of sentences in window vh that are assigned to global or local aspects. [sent-161, score-0.475]

54 ngcl,a is the number of sentences in all global aspects that are assigned to aspect a and nlco,cvh,a is the number of local aspect sentences in the window vh are assigned to aspect a. [sent-164, score-1.481]

55 Agl is the number of global aspects in collection C while Aloc is the number of local aspects. [sent-165, score-0.517]

56 E(l) represents the number of times of word l occurs in the current sentence and is assigned to be an aspect word, while E(. [sent-166, score-0.32]

57 ) is the total number of words in the current sentence that are assigned to be an aspect word. [sent-167, score-0.32]

58 C(B·) is the total nu(m·)ber of background words and C(a·) (i·)s the number of words assigned to aspect a. [sent-171, score-0.326]

59 , 2009) to measure the popularity of the event on a certain date. [sent-176, score-0.317]

60 Intuitively, each aspect have different bursties on different dates. [sent-177, score-0.313]

61 In this section, we try to obtain the temporal aspect sequences of an event based on the bursty periods of all the aspects. [sent-178, score-0.925]

62 During its bursty period, one aspect should (1)be more popular than other aspects (2) be continuously more popular than other time. [sent-179, score-0.704]

63 Following these intuitions, we design a method to measure the bursty of each aspect and get the bursty period. [sent-180, score-0.803]

64 Let Ak be the kth aspect obtained from the mixture-event-aspect model, we estimate the bursty of Ak at a certain date t as follows. [sent-181, score-0.598]

65 bursty(Ak,t) = p(t|Ak) =∑tp′(pA(kA|kt)|t′ ·) pp((t )′) where p(Ak |t) is measured by the number of sentences assigned t ios aspect Ak i bny yd tahtee tn dumivbideerd o by tehnetotal number of sentences in date t. [sent-182, score-0.493]

66 p(t) is estimated by the total number of sentences in aspect Ak divided by the overall number of sentences in the collection C. [sent-183, score-0.455]

67 After getting the bursty of aspect Ak at each date, we can find the most popular date and expand on both sides to obtain the whole burst period in which the bursties are higher than the neighboring aspects and continuous higher than other dates. [sent-184, score-0.958]

68 3 Optimization-based Storyline Generation With the methods discussed in previous sections, we can get the local/global aspect sequence. [sent-186, score-0.265]

69 Each aspect contains numbers of sentences and we are aiming to selectthemostrepresentative ones to compose the final storyline. [sent-187, score-0.286]

70 Considering users’ bias and the length requirement, different aspects should have different proportions in the last storyline. [sent-188, score-0.248]

71 For global aspects which correspond more to users’ interest, they should share a larger proportion in the final storyline than local aspects. [sent-189, score-0.714]

72 Thus, we use an optimization method to determine if a sentence is selected to be an summary sentence or to be discarded based on the multiple local/global aspects and finally get the optimal storyline. [sent-190, score-0.455]

73 We formalize this problem as selecting a subset of sentences S from the aspect Ak to minimize the information loss. [sent-191, score-0.325]

74 To model different costs between global or local aspects and determine the proportions of different aspects in the final storyline, we utilize a function ζ(s). [sent-195, score-0.651]

75 When sentences z and s are local aspect sentences, ζ(s) = χ, or, ζ(s) = 1 − χ. [sent-196, score-0.385]

76 ng F/ionr-creasing logistic functions, ℓ1 (x) = 1/(1 + ex) and ℓ2 (x) = ex/(1 + ex), to define the cost function as O(z, s) = ζ(s) · ℓ1 (S(s)) · ℓ2 (S(z)) · DKL (s, z) where S(s) and S(z) are the ranking scores of sentences s and z among the aspect Ak with LexPageRank algorithm. [sent-198, score-0.286]

77 With this optimization method, we get the representative sentences ofeach aspect for the given event subject. [sent-200, score-0.74]

78 Combining all the representative sentences together based on the aspect sequence, we finally generate the storyline. [sent-201, score-0.335]

79 1 Datasets To evaluate our framework for event storyline generation, we conduct our experiments on the dataset- s amounting to 12418 articles for 6 event subjects from 6 famous news websites, which provide date edited by professional editors. [sent-203, score-1.097]

80 To generate reference summary, we invite 12 undergraduate students with good English ability to read the sentences, and for each event subject we ask two students to label human storylines. [sent-206, score-0.357]

81 Specifically, baseline 2 and 3 are summarization systems which are similar to our storyline generation system. [sent-225, score-0.498]

82 • LDA+LexPageRank (LDALR) : This method firs•t applies esxtPanadgaerRda nLkD A(L DtoA LdeRt)ec :t Tlahtiesnt m topics from the collection and clusters sentences to multiple aspects, then utilizes PageRank to generate the most representative component summaries from all the aspects. [sent-232, score-0.445]

83 • MEA+LexPageRank (MEALR) : This method applies tAhe+ proposed mixture-event-aspect sm moedtehl otod cluster sentences into multiple aspects and then utilizes PageRank to generate the most representative component summaries from all the aspects. [sent-233, score-0.535]

84 • MEA+Optimization (MEAOp) : This method ext•ra MctsE local/global aspects wOpith) :th Teh proposed mixture-event-aspect model, and then utilizes the optimization method to get the qualified summary. [sent-234, score-0.385]

85 LTehxisR aisn dku sey tsote tmhe o fuatcpte trhfoatr mLesx RRaanndko mran aklsall the sentences based on eigenvector centrality and the global relationship between sentences, which tends to select the most informative sentences as the summary. [sent-261, score-0.314]

86 5 Parameter Tuning Figure 5: Aspect sequence of event “Connecticut school shooting” (X-axis is the number of days after Dec. [sent-270, score-0.38]

87 Y-axis is bursty(Ak,t)) We take the event “Connecticut school shooting” as an example to show the usefulness of our method. [sent-282, score-0.38]

88 Figure 5 shows the aspect sequence based on the bursty periods of all aspects. [sent-283, score-0.539]

89 Table 4 shows part of the storylines for the event “Connecticut school shooting” generated by human and our method. [sent-285, score-0.494]

90 Through observation, we find that the peak of the event “Connecticut school shooting” is around the date when it occurred, and the sub-event “Gun control debate” has two bursty periods around the two peaks. [sent-286, score-0.786]

91 Compared with the human summary, our framework can extract the important sub-events contained in the collection, and satisfy users’ interest on different sub-events based on the part-whole relationship with the event subject. [sent-287, score-0.38]

92 (1) The component summary of global aspect tend to share larger proTable 4: Selected part of storyline generated by MEAOp and human portion in the final storyline. [sent-289, score-0.79]

93 This is mainly for the reason that when researching for an event subject, users bias more to the information about the global-sub-events that have closely connection and coincident properties with the major event based on the part-whole relationship. [sent-290, score-0.782]

94 5 Conclusion In this work, we study the task of event storyline generation and present a novel method. [sent-296, score-0.676]

95 We innovatively introduce the properties of different subevents based on word co-occurrences to determine the part-whole relationship with the major event and develop a mixture-event-aspect (MEA) model to formalize different types of sub-events into local/global aspects. [sent-297, score-0.617]

96 Based on these local/global aspects, we utilize an optimization method to get the optimal com734 ponent summaries along the aspect sequence. [sent-298, score-0.484]

97 Through our experiments we notice that our method generates overall better storyline than other baselines. [sent-300, score-0.311]

98 This indicates the effectiveness to detect different types of sub-events with the proposed mixture-event-aspect model and the necessity to distinguish different proportions of the component summaries based on local/global aspects. [sent-301, score-0.361]

99 Text classification and named entities for new event detection. [sent-334, score-0.317]

100 Mining correlated bursty topic patterns from coordinated text streams. [sent-413, score-0.307]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('event', 0.317), ('storyline', 0.311), ('bursty', 0.269), ('aspect', 0.231), ('connecticut', 0.206), ('aspects', 0.204), ('chieu', 0.18), ('shooting', 0.165), ('ak', 0.151), ('subevents', 0.144), ('summarization', 0.139), ('summaries', 0.125), ('mea', 0.124), ('storylines', 0.114), ('collection', 0.114), ('gl', 0.107), ('global', 0.1), ('local', 0.099), ('shenzhen', 0.098), ('lexpagerank', 0.098), ('date', 0.098), ('dir', 0.093), ('summary', 0.093), ('vh', 0.09), ('sp', 0.088), ('bursties', 0.082), ('timeline', 0.082), ('window', 0.077), ('period', 0.074), ('aloc', 0.072), ('yan', 0.071), ('draw', 0.07), ('temporal', 0.069), ('gun', 0.065), ('relationship', 0.063), ('school', 0.063), ('rouge', 0.063), ('ldalr', 0.062), ('makkonen', 0.062), ('meaop', 0.062), ('resende', 0.062), ('mei', 0.058), ('sentences', 0.055), ('component', 0.055), ('optimization', 0.054), ('news', 0.054), ('assigned', 0.054), ('major', 0.054), ('prestige', 0.054), ('kumaran', 0.054), ('minka', 0.054), ('users', 0.053), ('cc', 0.052), ('evolutionary', 0.052), ('gram', 0.052), ('events', 0.051), ('detect', 0.05), ('distinctive', 0.05), ('multi', 0.049), ('tdt', 0.049), ('representative', 0.049), ('generation', 0.048), ('pagerank', 0.048), ('loc', 0.048), ('utilizes', 0.047), ('necessity', 0.046), ('qualified', 0.046), ('balance', 0.044), ('proportions', 0.044), ('coincident', 0.041), ('countmatch', 0.041), ('distinctively', 0.041), ('divrank', 0.041), ('guangdong', 0.041), ('mealr', 0.041), ('pricai', 0.041), ('sey', 0.041), ('werneck', 0.041), ('background', 0.041), ('distinguish', 0.041), ('xiaoming', 0.041), ('radev', 0.04), ('subject', 0.04), ('along', 0.04), ('formalize', 0.039), ('debate', 0.039), ('periods', 0.039), ('titov', 0.039), ('topic', 0.038), ('cloud', 0.036), ('lappas', 0.036), ('incident', 0.036), ('dingding', 0.036), ('implicated', 0.036), ('mining', 0.035), ('sentence', 0.035), ('sigkdd', 0.035), ('wang', 0.034), ('get', 0.034), ('zhai', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

Author: Lifu Huang ; Lian'en Huang

2 0.28754845 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

Author: Shize Xu ; Shanshan Wang ; Yan Zhang

Abstract: The rapid development of Web2.0 leads to significant information redundancy. Especially for a complex news event, it is difficult to understand its general idea within a single coherent picture. A complex event often contains branches, intertwining narratives and side news which are all called storylines. In this paper, we propose a novel solution to tackle the challenging problem of storylines extraction and reconstruction. Specifically, we first investigate two requisite properties of an ideal storyline. Then a unified algorithm is devised to extract all effective storylines by optimizing these properties at the same time. Finally, we reconstruct all extracted lines and generate the high-quality story map. Experiments on real-world datasets show that our method is quite efficient and highly competitive, which can bring about quicker, clearer and deeper comprehension to readers.

3 0.23622255 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter

Author: Qiming Diao ; Jing Jiang

Abstract: With the rapid growth of social media, Twitter has become one of the most widely adopted platforms for people to post short and instant message. On the one hand, people tweets about their daily lives, and on the other hand, when major events happen, people also follow and tweet about them. Moreover, people’s posting behaviors on events are often closely tied to their personal interests. In this paper, we try to model topics, events and users on Twitter in a unified way. We propose a model which combines an LDA-like topic model and the Recurrent Chinese Restaurant Process to capture topics and events. We further propose a duration-based regularization component to find bursty events. We also propose to use event-topic affinity vectors to model the asso- . ciation between events and topics. Our experiments shows that our model can accurately identify meaningful events and the event-topic affinity vectors are effective for event recommendation and grouping events by topics.

4 0.19353172 118 emnlp-2013-Learning Biological Processes with Global Constraints

Author: Aju Thalappillil Scaria ; Jonathan Berant ; Mengqiu Wang ; Peter Clark ; Justin Lewis ; Brittany Harding ; Christopher D. Manning

Abstract: Biological processes are complex phenomena involving a series of events that are related to one another through various relationships. Systems that can understand and reason over biological processes would dramatically improve the performance of semantic applications involving inference such as question answering (QA) – specifically “How? ” and “Why? ” questions. In this paper, we present the task of process extraction, in which events within a process and the relations between the events are automatically extracted from text. We represent processes by graphs whose edges describe a set oftemporal, causal and co-reference event-event relations, and characterize the structural properties of these graphs (e.g., the graphs are connected). Then, we present a method for extracting relations between the events, which exploits these structural properties by performing joint in- ference over the set of extracted relations. On a novel dataset containing 148 descriptions of biological processes (released with this paper), we show significant improvement comparing to baselines that disregard process structure.

5 0.17866407 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

Author: Tengfei Ma ; Hiroshi Nakagawa

Abstract: Document summarization is an important task in the area of natural language processing, which aims to extract the most important information from a single document or a cluster of documents. In various summarization tasks, the summary length is manually defined. However, how to find the proper summary length is quite a problem; and keeping all summaries restricted to the same length is not always a good choice. It is obviously improper to generate summaries with the same length for two clusters of documents which contain quite different quantity of information. In this paper, we propose a Bayesian nonparametric model for multidocument summarization in order to automatically determine the proper lengths of summaries. Assuming that an original document can be reconstructed from its summary, we describe the ”reconstruction” by a Bayesian framework which selects sentences to form a good summary. Experimental results on DUC2004 data sets and some expanded data demonstrate the good quality of our summaries and the rationality of the length determination.

6 0.14819726 192 emnlp-2013-Unsupervised Induction of Contingent Event Pairs from Film Scenes

7 0.13951923 41 emnlp-2013-Building Event Threads out of Multiple News Articles

8 0.13160105 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction

9 0.12789656 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles

10 0.11901019 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

11 0.11175046 75 emnlp-2013-Event Schema Induction with a Probabilistic Entity-Driven Model

12 0.11073832 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

13 0.10367059 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

14 0.098919846 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

15 0.09492097 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

16 0.084803723 65 emnlp-2013-Document Summarization via Guided Sentence Compression

17 0.073195517 121 emnlp-2013-Learning Topics and Positions from Debatepedia

18 0.068179548 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

19 0.063229993 90 emnlp-2013-Generating Coherent Event Schemas at Scale

20 0.061077632 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.219), (1, 0.174), (2, -0.124), (3, 0.308), (4, -0.003), (5, -0.215), (6, 0.056), (7, -0.041), (8, -0.061), (9, -0.002), (10, -0.039), (11, -0.004), (12, 0.045), (13, -0.121), (14, 0.012), (15, -0.023), (16, 0.147), (17, -0.018), (18, -0.084), (19, -0.156), (20, -0.058), (21, -0.107), (22, -0.079), (23, 0.077), (24, 0.104), (25, 0.124), (26, 0.011), (27, 0.081), (28, 0.027), (29, -0.103), (30, -0.032), (31, 0.08), (32, 0.029), (33, -0.07), (34, 0.064), (35, -0.067), (36, 0.035), (37, 0.041), (38, -0.03), (39, -0.032), (40, -0.031), (41, 0.024), (42, 0.032), (43, 0.043), (44, -0.023), (45, -0.071), (46, 0.017), (47, -0.082), (48, -0.044), (49, -0.15)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96071923 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

Author: Lifu Huang ; Lian'en Huang

2 0.73931605 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

Author: Shize Xu ; Shanshan Wang ; Yan Zhang

3 0.63687623 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter

Author: Qiming Diao ; Jing Jiang

4 0.62005919 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

Author: Tengfei Ma ; Hiroshi Nakagawa

5 0.58944684 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou

Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1

6 0.56816483 118 emnlp-2013-Learning Biological Processes with Global Constraints

7 0.56585813 192 emnlp-2013-Unsupervised Induction of Contingent Event Pairs from Film Scenes

8 0.53828311 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

9 0.49394348 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles

10 0.45150816 41 emnlp-2013-Building Event Threads out of Multiple News Articles

11 0.42672977 85 emnlp-2013-Fast Joint Compression and Summarization via Graph Cuts

12 0.40328488 131 emnlp-2013-Mining New Business Opportunities: Identifying Trend related Products by Leveraging Commercial Intents from Microblogs

13 0.38772017 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

14 0.38372138 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction

15 0.33938855 75 emnlp-2013-Event Schema Induction with a Probabilistic Entity-Driven Model

16 0.32258669 94 emnlp-2013-Identifying Manipulated Offerings on Review Portals

17 0.30727047 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models

18 0.30371079 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

19 0.2789408 174 emnlp-2013-Single-Document Summarization as a Tree Knapsack Problem

20 0.27623034 24 emnlp-2013-Application of Localized Similarity for Web Documents

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.028), (9, 0.013), (18, 0.02), (22, 0.051), (30, 0.051), (43, 0.01), (50, 0.011), (51, 0.129), (66, 0.034), (71, 0.026), (75, 0.521), (96, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8877005 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

Author: Lifu Huang ; Lian'en Huang

2 0.83813751 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

Author: Congle Zhang ; Daniel S. Weld

Abstract: The distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings, has inspired several Web mining algorithms for paraphrasing semantically equivalent phrases. Unfortunately, these methods have several drawbacks, such as confusing synonyms with antonyms and causes with effects. This paper introduces three Temporal Correspondence Heuristics, that characterize regularities in parallel news streams, and shows how they may be used to generate high precision paraphrases for event relations. We encode the heuristics in a probabilistic graphical model to create the NEWSSPIKE algorithm for mining news streams. We present experiments demonstrating that NEWSSPIKE significantly outperforms several competitive baselines. In order to spur further research, we provide a large annotated corpus of timestamped news arti- cles as well as the paraphrases produced by NEWSSPIKE.

3 0.81378722 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

Author: Aliaksei Severyn ; Alessandro Moschitti

Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.

4 0.75557876 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

Author: Christian Hardmeier ; Jorg Tiedemann ; Joakim Nivre

Abstract: This paper addresses the task of predicting the correct French translations of third-person subject pronouns in English discourse, a problem that is relevant as a prerequisite for machine translation and that requires anaphora resolution. We present an approach based on neural networks that models anaphoric links as latent variables and show that its performance is competitive with that of a system with separate anaphora resolution while not requiring any coreference-annotated training data. This demonstrates that the information contained in parallel bitexts can successfully be used to acquire knowledge about pronominal anaphora in an unsupervised way. 1 Motivation When texts are translated from one language into another, the translation reconstructs the meaning or function of the source text with the means of the target language. Generally, this has the effect that the entities occurring in the translation and their mutual relations will display similar patterns as the entities in the source text. In particular, coreference patterns tend to be very similar in translations of a text, and this fact has been exploited with good results to project coreference annotations from one language into another by using word alignments (Postolache et al., 2006; Rahman and Ng, 2012). On the other hand, what is true in general need not be true for all types of linguistic elements. For instance, a substantial percentage ofthe English thirdperson subject pronouns he, she, it and they does not get realised as pronouns in French translations (Hardmeier, 2012). Moreover, it has been recognised 380 by various authors in the statistical machine translation (SMT) community (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) that pronoun translation is a difficult problem because, even when a pronoun does get translated as a pronoun, it may require choosing the correct word form based on agreement features that are not easily pre- dictable from the source text. The work presented in this paper investigates the problem of cross-lingual pronoun prediction for English-French. Given an English pronoun and its discourse context as well as a French translation of the same discourse and word alignments between the two languages, we attempt to predict the French word aligned to the English pronoun. As far as we know, this task has not been addressed in the literature before. In our opinion, it is interesting for several reasons. By studying pronoun prediction as a task in its own right, we hope to contribute towards a better understanding of pronoun translation with a longterm view to improving the performance of SMT systems. Moreover, we believe that this task can lead to interesting insights about anaphora resolution in a multi-lingual context. In particular, we show in this paper that the pronoun prediction task makes it possible to model the resolution of pronominal anaphora as a latent variable and opens up a way to solve a task relying on anaphora resolution without using any data annotated for anaphora. This is what we consider the main contribution of our present work. We start by modelling cross-lingual pronoun pre- diction as an independent machine learning task after doing anaphora resolution in the source language (English) using the BART software (Broscheit et al., 2010). We show that it is difficult to achieve satisfactory performance with standard maximumProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 3t8ic0s–391, The latest version released in March is equipped with ...It is sold at ... La dernière version lancée en mars est dotée de ... • est vendue ... • Figure 1: Task setup entropy classifiers especially for low-frequency pronouns such as the French feminine plural pronoun elles. We propose a neural network classifier that achieves better precision and recall and manages to make reasonable predictions for all pronoun categories in many cases. We then go on to extend our neural network architecture to include anaphoric links as latent variables. We demonstrate that our classifier, now with its own source language anaphora resolver, can be trained successfully with backpropagation. In this setup, we no longer use the machine learning component included in the external coreference resolution system (BART) to predict anaphoric links. Anaphora resolution is done by our neural network classifier and requires only some quantity of word-aligned parallel data for training, completely obviating the need for a coreference-annotated training set. 2 Task Setup The overall setup of the classification task we address in this paper is shown in Figure 1. We are given an English discourse containing a pronoun along with its French translation and word alignments between the two languages, which in our case were computed automatically using a standard SMT pipeline with GIZA++ (Och and Ney, 2003). We focus on the four English third-person subject pronouns he, she, it and they. The output of the classifier is a multinomial distribution over six classes: the four French subject pronouns il, elle, ils and elles, corresponding to masculine and feminine singular and plural, respectively; the impersonal pronoun ce/c’, which occurs in some very frequent constructions such as c’est (it is); and a sixth class OTHER, which indicates that none of these pronouns was used. In general, a pronoun may be aligned to multiple words; in this case, a training example is counted as a positive example for a class if the target word occurs among the words aligned to the pronoun, irrespective of the presence of other 381 word candidate training ex. verseiol ena0 0 1 01 10 0 0 .0510 .50 p 12= . 910. 5.9 050 Figure 2: Antecedent feature aggregation aligned tokens. This task setup resembles the problem that an SMT system would have to solve to make informed choices when translating pronouns, an aspect oftranslation neglected by most existing SMT systems. An important difference between the SMT setup and our own classifiers is that we use context from humanmade translations for prediction. This potentially makes the task both easier and more difficult; easier, because the context can be relied on to be correctly translated, and more difficult, because human translators frequently create less literal translations than an SMT system would. Integrating pronoun prediction into the translation process would require significant changes to the standard SMT decoding setup in order to take long-range dependencies in the target language into account, which is why we do not address this issue in our current work. In all the experiments presented in this paper, we used features from two different sources: Anaphora context features describe the source language pronoun and its immediate context consisting of three words to its left and three words to its right. They are encoded as vectors whose dimensionality is equal to the source vocabulary size with a single non-zero component indicating the word referred to (one-hot vectors). Antecedent features describe an antecedent candidate. Antecedent candidates are represented by the target language words aligned to the syntactic head of the source language markable TED News ce 16.3 % 6.4 % elle 7.1 % 10.1 % elles 3.0 % 3.9 % il 17.1 % 26.5 % ils 15.6 % 15.1 % OTHER 40.9 % 38.0 % – – Table 1: Distribution of classes in the training data noun phrase as identified by the Collins head finder (Collins, 1999). The different handling of anaphora context features and antecedent features is due to the fact that we always consider a constant number of context words on the source side, whereas the number of word vectors to be considered depends on the number of antecedent candidates and on the number of target words aligned to each antecedent. The encoding of the antecedent features is illustrated in Figure 2 for a training example with two antecedent candidates translated to elle and la version, respectively. The target words are represented as one-hot vectors with the dimensionality of the target language vocabulary. These vectors are then averaged to yield a single vector per antecedent candidate. Finally, the vectors of all candidates for a given training example are weighted by the probabilities assigned to them by the anaphora resolver (p1 and p2) and summed to yield a single vector per training example. 3 Data Sets and External Tools We run experiments with two different test sets. The TED data set consists of around 2.6 million tokens of lecture subtitles released in the WIT3 corpus (Cettolo et al., 2012). The WIT3 training data yields 71,052 examples, which were randomly partitioned into a training set of 63,228 examples and a test set of 7,824 examples. The official WIT3 development and test sets were not used in our experiments. The news-commentary data set is version 6 of the parallel news-commentary corpus released as a part of the WMT 2011training data1 . It contains around 2.8 million tokens ofnews text and yields 3 1,017 data points, 1http: //www. statmt .org/wmt11/translation-task. html (3 July 2013). 382 which were randomly split into 27,900 training examples and 3,117 test instances. The distribution of the classes in the two training sets is shown in Table 1. One thing to note is the dominance of the OTHER class, which pools together such different phenomena as translations with other pronouns not in our list (e. g., celui-ci) and translations with full noun phrases instead of pronouns. Splitting this group into more meaningful subcategories is not straightforward and must be left to future work. The feature setup of all our classifiers requires the detection of potential antecedents and the extraction of features pairing anaphoric pronouns with antecedent candidates. Some of our experiments also rely on an external anaphora resolution component. We use the open-source anaphora resolver BART to generate this information. BART (Broscheit et al., 2010) is an anaphora resolution toolkit consisting of a markable detection and feature extraction pipeline based on a variety of standard natural language processing (NLP) tools and a machine learning component to predict coreference links including both pronominal anaphora and noun-noun coreference. In our experiments, we always use BART’s markable detection and feature extraction machinery. Markable detection is based on the identification of noun phrases in constituency parses generated with the Stanford parser (Klein and Manning, 2003). The set of features extracted by BART is an extension of the widely used mention-pair anaphora resolution feature set by Soon et al. (2001) (see below, Section 6). In the experiments of the next two sections, we also use BART to predict anaphoric links for pronouns. The model used with BART is a maximum entropy ranker trained on the ACE02-npaper corpus (LDC2003T1 1). In order to obtain a probability distribution over antecedent candidates rather than onebest predictions or coreference sets, we modified the ranking component with which BART resolves pronouns to normalise and output the scores assigned by the ranker to all candidates instead of picking the highest-scoring candidate. 4 Baseline Classifiers In order to create a simple, but reasonable baseline for our task, we trained a maximum entropy (ME) ce TED (Accuracy: 0.685) P R 0.593 0.728 F 0.654 elle 0.798 0.523 elles 0.812 0.164 il 0.764 0.550 ils 0.632 0.949 OTHER 0.724 0.692 News commentary (Accuracy: 0.576) ce elle elles il ils OTHER P 0.508 0.530 0.538 0.600 0.593 0.564 R 0.294 0.312 0.062 0.666 0.769 0.609 Table 2: Maximum entropy classifier results 0.632 0.273 0.639 0.759 0.708 F 0.373 0.393 0.111 0.631 0.670 0.586 TED (Accuracy: 0.700) P R ce 0.634 0.747 elle 0.756 0.617 elles 0.679 0.319 il 0.719 0.591 ils 0.663 0.940 OTHER 0.743 0.678 News commentary (Accuracy: 0.576) F 0.686 0.679 0.434 0.649 0.778 0.709 P 0.477 0.498 F 0.400 0.444 ce elle R 0.344 0.401 elles il ils OTHER 0.565 0.655 0.570 0.567 0.116 0.626 0.834 0.573 0.193 0.640 0.677 0.570 Table 3: Neural network classifier with anaphoras resolved by BART classifier with the MegaM software package2 using the features described in the previous section and the anaphora links found by BART. Results are shown in Table 2. The baseline results show an overall higher accuracy for the TED data than for the newscommentary data. While the precision is above 50 % in all categories and considerably higher in some, recall varies widely. The pronoun elles is particularly interesting. This is the feminine plural of the personal pronoun, and it usually corresponds to the English pronoun they, which is not marked for gender. In French, elles is a marked choice which is only used if the antecedent exclusively refers to females or feminine-gendered objects. The presence of a single item with masculine grammatical gender in the antecedent will trigger the use of the masculine plural pronoun ils instead. This distinction cannot be predicted from the English source pronoun or its context; making correct predictions requires knowledge about the antecedent of the pronoun. Moreover, elles is a low-frequency pronoun. There are only 1,909 occurrences of this pro2http : //www . umiacs .umd .edu/~hal/megam/ (20 June 2013). 383 noun in the TED training data, and 1,077 in the newscommentary training set. Because of these special properties of the feminine plural class, we argue that the performance of a classifier on elles is a good indicator ofhow well it can represent relevant knowledge about pronominal anaphora as opposed to overfitting to source contexts or acting on prior assumptions about class frequencies. In accordance with the general linguistic preference for ils, the classifier tends to predict ils much more often than elles when encountering an English plural pronoun. This is reflected in the fact that elles has much lower recall than ils. Clearly, the classifier achieves a good part of its accuracy by making ma- jority choices without exploiting deeper knowledge about the antecedents of pronouns. An additional experiment with a subset of 27,900 training examples from the TED data confirms that the difference between TED and news commentaries is not just an effect of training data size, but that TED data is genuinely easier to predict than news commentaries. In the reduced data TED condition, the classifier achieves an accuracy of 0.673. Precision and recall of all classifiers are much closer to the Figure 3: Neural network for pronoun prediction large-data TED condition than to the news commentary experiments, except for elles, where we obtain an F-score of 0.072 (P 0.818, R 0.038), indicating that small training data size is a serious problem for this low-frequency class. 5 Neural Network Classifier In the previous section, we saw that a simple multiclass maximum entropy classifier, while making correct predictions for much of the data set, has a significant bias towards making majority class decisions, relying more on prior assumptions about the frequency distribution of the classes than on antecedent features when handling examples of less frequent classes. In order to create a system that can be trained to rely more explicitly on antecedent information, we created a neural network classifier for our task. The introduction of a hidden layer should enable the classifier to learn abstract concepts such as gender and number that are useful across multiple output categories, so that the performance of sparsely represented classes can benefit from the training examples of the more frequent classes. The overall structure of the network is shown in Figure 3. As inputs, the network takes the same features that were available to the baseline ME classifier, based on the source pronoun (P) with three words of context to its left (L1 to L3) and three words to its right (R1 to R3) as well as the words aligned to the syntactic head words of all possible antecedent candidates as found by BART (A). All words are 384 encoded as one-hot vectors whose dimensionality is equal to the vocabulary size. If multiple words are aligned to the syntactic head of an antecedent candidate, their word vectors are averaged with uniform weights. The resulting vectors for each antecedent are then averaged with weights defined by the posterior distribution of the anaphora resolver in BART (p1 to p3). The network has two hidden layers. The first layer (E) maps the input word vectors to a low-dimensional representation. In this layer, the embedding weights for all the source language vectors (the pronoun and its 6 context words) are tied, so if two words are the same, they are mapped to the same lowerdimensional embedding irrespective of their position relative to the pronoun. The embedding of the antecedent word vectors is independent, as these word vectors represent target language words. The entire embedding layer is then mapped to another hidden layer (H), which is in turn connected to a softmax output layer (S) with 6 outputs representing the classes ce, elle, elles, il, ils and OTHER. The non-linearity of both hidden layers is the logistic sigmoid function, f(x) = 1/(1 + e−x). In all experiments reported in this paper, the dimensionality of the source and target language word embeddings is 20, resulting in a total embedding layer size of 160, and the size of the last hidden layer is equal to 50. These sizes are fairly small. In experiments with larger layer sizes, we were able to obtain similar, but no better results. The neural network is trained with mini-batch stochastic gradient descent with backpropagated gradients using the RMSPROP algorithm with crossentropy as the objective function.3 In contrast to standard gradient descent, RMSPROP normalises the magnitude of the gradient components by dividing them by a root-mean-square moving average. We found this led to faster convergence. Other features of our training algorithm include the use of momentum to even out gradient oscillations, adaptive learning rates for each weight as well as adaptation of the global learning rate as a function of current training progress. The network is regularised with an ‘2 weight penalty. Good settings of the initial learning rate and the weight cost parameter (both around 0.001 in most experiments) were found by manual experi- mentation. Generally, we train our networks for 300 epochs, compute the validation error on a held-out set of some 10 % of the training data after each epoch and use the model that achieved the lowest validation error for testing. Since the source context features are very informative and it is comparatively more difficult to learn from the antecedents, the network sometimes had a tendency to overfit to the source features and disregard antecedent information. We found that this problem can be solved effectively by presenting a part of the training without any source features, forcing the network to learn from the information contained in the antecedents. In all experiments in this paper, we zero out all source features (input layers P, L1to L3 and R1 to R3) with a probability of 50 % in each training example. At test time, no information is zeroed out. Classification results with this network are shown in Table 3. We note that the accuracy has increased slightly for the TED test set and remains exactly the same for the news commentary corpus. However, a closer look on the results for individual classes reveals that the neural network makes better predictions for almost all classes. In terms of F-score, the only class that becomes slightly worse is the OTHER class for the news commentary corpus because of lower recall, indicating that the neural network classifier is less biased towards using the uninformative OTHER 3Our training procedure is greatly inspired by a series of online lectures held by Geoffrey Hinton in 2012 (https : //www . coursera. .org/course/neuralnets, 10 September 2013). 385 category. Recall for elle and elles increases considerably, but especially for elles it is still quite low. The increase in recall comes with some loss in precision, but the net effect on F-score is clearly positive. 6 Latent Anaphora Resolution Considering Figure 1 again, we note that the bilingual setting of our classification task adds some information not available to the monolingual anaphora resolver that can be helpful when determining the correct antecedent for a given pronoun. Knowing the gender of the translation of a pronoun limits the set of possible antecedents to those whose translation is morphologically compatible with the target language pronoun. We can exploit this fact to learn how to resolve anaphoric pronouns without requiring data with manually annotated anaphoric links. To achieve this, we extend our neural network with a component to predict the probability of each antecedent candidate to be the correct antecedent (Figure 4). The extended network is identical to the previous version except for the upper left part dealing with anaphoric link features. The only difference between the two networks is the fact that anaphora resolution is now performed by a part of our neural network itself instead of being done by an external module and provided to the classifier as an input. In this setup, we still use some parts of the BART toolkit to extract markables and compute features. However, we do not make use of the machine learning component in BART that makes the actual predictions. Since this is the only component trained on coreference-annotated data in a typical BART configuration, no coreference annotations are used anywhere in our system even though we continue to rely on the external anaphora resolver for preprocessing to avoid implementing our own markable and feature extractors and to make comparison easier. For each candidate markable identified by BART’s preprocessing pipeline, the anaphora resolution model receives as input a link feature vector (T) describing relevant aspects of the antecedent candidateanaphora pair. This feature vector is generated by the feature extraction machinery in BART and includes a standard feature set for coreference resolution partially based on work by Soon et al. (2001). We use the following feature extractors in BART, each of Figure 4: Neural network with latent anaphora resolution which can generate multiple features: Anaphora mention type Gender match Number match String match Alias feature (Soon et al., 2001) Appositive position feature (Soon et al., 2001) Semantic class (Soon et al., 2001) – – – – – – – Semantic class match Binary distance feature Antecedent is first mention in sentence Our baseline set of features was borrowed wholesale from a working coreference system and includes some features that are not relevant to the task at hand, e. g., features indicating that the anaphora is a pronoun, is not a named entity, etc. After removing all features that assume constant values in the training set when resolving antecedents for the set of pronouns we consider, we are left with a basic set of 37 anaphoric link features that are fed as inputs to our network. These features are exactly the same as those available to the anaphora resolution classifier in the BART system used in the previous section. Each training example for our network can have an arbitrary number of antecedent candidates, each of which is described by an antecedent word vector (A) and by an anaphoric link vector (T). The anaphoric link features are first mapped to a regular hidden layer with logistic sigmoid units (U). The activations of the hidden units are then mapped to a single value, which – – – 386 functions as an element in a softmax layer over all an- tecedent candidates (V). This softmax layer assigns a probability to each antecedent candidate, which we then use to compute a weighted average over the antecedent word vector, replacing the probabilities pi in Figures 2 and 3. At training time, the network’s anaphora resolution component is trained in exactly the same way as the rest of the network. The error signal from the embedding layer is backpropagated both to the weight matrix defining the antecedent word embedding and to the anaphora resolution subnetwork. Note that the number of weights in the network is the same for all training examples even though the number of antecedent candidates varies because all weights related to antecedent word features and anaphoric link features are shared between all antecedent candidates. One slightly uncommon feature of our neural network is that it contains an internal softmax layer to generate normalised probabilities over all possible antecedent candidates. Moreover, weights are shared between all antecedent candidates, so the inputs of our internal softmax layer share dependencies on the same weight variables. When computing derivatives with backpropagation, these shared dependen- cies must be taken into account. In particular, the outputs yi ofthe antecedent resolution layer are the result of a softmax applied to functions of some shared variables q: yi=∑kexepxp fi( fkq()q) (1) The derivatives of any yi with respect to q, which can be any of the weights in the anaphora resolution subnetwork, have dependencies on the derivatives of the other softmax inputs with respect to q: ∂∂yqi= yi ∂ f∂i(qq)−∑kyk∂ f∂k(qq)! (2) This makes the implementation of backpropagation for this part of the network somewhat more complicated, but in the case of our networks, it has no major impact on training time. Experimental results for this network are shown in Table 4. Compared with Table 3, we note that the overall accuracy is only very slightly lower for TED, and for the news commentaries it is actually better. When it comes to F-scores, the performance for elles improves by a small amount, while the effect on the other classes is a bit more mixed. Even where it gets worse, the differences are not dramatic considering that we eliminated a very knowledge-rich resource from the training process. This demonstrates that it is possible, in our classification task, to obtain good results without using any data manually annotated for anaphora and to rely entirely on unsupervised latent anaphora resolution. 7 Further Improvements The results presented in the preceding section represent a clear improvement over the ME classifiers in Table 2, even though the overall accuracy increased only slightly. Not only does our neural network classifier achieve better results on the classification task at hand without requiring an anaphora resolution classifier trained on manually annotated data, but it performs clearly better for the feminine categories that reflect minority choices requiring knowledge about the antecedents. Nevertheless, the performance is still not entirely satisfactory. By subjecting the output of our classifier on a development set to a manual error analysis, we found that a fairly large number oferrors belong to two error types: On the one hand, the preprocessing pipeline used to identify antecedent candidates does not always include the correct antecedent in the set presented to the neural network. Whenever this occurs, it is obvious that the classifier cannot possibly find 387 the correct antecedent. Out of 76 examples of the category elles that had been mistakenly predicted as ils, we found that 43 suffered from this problem. In other classes, the problem seems to be somewhat less common, but it still exists. On the other hand, in many cases (23 out of 76 for the category mentioned before) the anaphora resolution subnetwork does identify an antecedent manually recognised to belong to the right gender/number group, but still predicts an incorrect pronoun. This may indicate that the network has difficulties learning a correct gender/number representation for all words in the vocabulary. 7.1 Relaxing Markable Extraction The pipeline we use to extract potential antecedent candidates is borrowed from the BART anaphora resolution toolkit. BART uses a syntactic parser to identify noun phrases as markables. When extracting antecedent candidates for coreference prediction, it starts by considering a window consisting of the sentence in which the anaphoric pronoun is located and the two immediately preceding sentences. Markables in this window are checked for morphological compatibility in terms of gender and number with the anaphoric pronoun, and only compatible markables are extracted as antecedent candidates. If no compatible markables are found in the initial window, the window is successively enlarged one sentence at a time until at least one suitable markable is found. Our error analysis shows that this procedure misses some relevant markables both because the initial two-sentence extraction window is too small and because the morphological compatibility check incorrectly filters away some markables that should have been considered as candidates. By contrast, the extraction procedure does extract quite a number of first and second person noun phrases (I, we, you and their oblique forms) in the TED talks which are extremely unlikely to be the antecedent of a later occurrence of he, she, it or they. As a first step, we therefore adjust the extraction criteria to our task by increasing the initial extraction window to five sentences, excluding first and second person markables and removing the morphological compatibility requirement. The compatibility check is still used to control expansion of the extraction window, but it is no longer applied to filter the extracted markables. This increases the accuracy to 0.701 for TED and 0.602 for the news TED (Accuracy: 0.696) P R ce 0.618 0.722 elle 0.754 0.548 elles 0.737 0.340 il 0.718 0.629 ils 0.652 0.916 OTHER 0.741 0.682 F 0.666 0.635 0.465 0.670 0.761 0.711 News commentary (Accuracy: 0.597) ce elle elles il ils OTHER P 0.419 0.547 0.539 0.623 0.596 0.614 R 0.368 0.460 0.135 0.719 0.783 0.544 F 0.392 0.500 0.215 0.667 0.677 0.577 Table 4: Neural network classifier with latent anaphora resolution TED (Accuracy: 0.713) ce elle P 0.61 1 0.749 R 0.723 0.596 F 0.662 0.664 elles 0.602 0.616 il 0.733 0.638 ils 0.710 0.884 OTHER 0.760 0.704 News commentary (Accuracy: 0.626) ce elle elles il ils OTHER P 0.492 0.526 0.547 0.599 0.671 0.681 Table 5: Final classifier R 0.324 0.439 0.558 0.757 0.878 0.526 0.609 0.682 0.788 0.731 F 0.391 0.478 0.552 0.669 0.761 0.594 results commentaries, while the performance for elles im- proves to F-scores of 0.531 (TED; P 0.690, R 0.432) and 0.304 (News commentaries; P 0.444, R 0.231), respectively. Note that these and all the following results are not directly comparable to the ME baseline results in Table 2, since they include modifications and improvements to the training data extraction procedure that might possibly lead to benefits in the ME setting as well. 7.2 Adding Lexicon Knowledge In order to make it easier for the classifier to identify the gender and number properties of infrequent words, we extend the word vectors with features indicating possible morphological features for each word. In early experiments with ME classifiers, we found that our attempts to do proper gender and number tagging in French text did not improve classification performance noticeably, presumably because the annotation was too noisy. In more recent experiments, we just add features indicating all possible morphological interpretations of each word, rather than trying to disambiguate them. To do this, we look up the morphological annotations of the French words in the Lefff dictionary (Sagot et al., 2006) and intro- 388 duce a set of new binary features to indicate whether a particular reading of a word occurs in that dictionary. These features are then added to the one-hot representation of the antecedent words. Doing so improves the classifier accuracy to 0.71 1 (TED) and 0.604 (News commentaries), while the F-scores for elles reach 0.589 (TED; P 0.649, R 0.539) and 0.500 (News commentaries; P 0.545, R 0.462), respectively. 7.3 More Anaphoric Link Features Even though the modified antecedent candidate extraction with its larger context window and without the morphological filter results in better performance on both test sets, additional error analysis reveals that the classifiers has greater problems identifying the correct markable in this setting. One reason for this may be that the baseline anaphoric link feature set described above (Section 6) only includes two very rough binary distance features which indicate whether or not the anaphora and the antecedent candidate occur in the same or in immediately adjacent sentences. With the larger context window, this may be too unspecific. In our final experiment, we there- fore enable some additional features which are available in BART, but disabled in the baseline system: Distance in number of markables Distance in number of sentences Sentence distance, log-transformed Distance in number of words Part of speech of head word Most of these encode the distance between the anaphora and the antecedent candidate in more precise ways. Complete results for this final system are presented in Table 5. Including these additional features leads to another slight increase in accuracy for both corpora, with similar or increased classifier F-scores for most classes except elle in the news commentary condition. In particular, we should like to point out the performance of our benchmark classifier for elles, which suffered from extremely low recall in the first classifiers and approaches the performance ofthe other classes, with nearly balanced precision and recall, in this final system. Since elles is a low-frequency class and cannot be reliably predicted using source context alone, we interpret this as evidence that our final neural network classifier has incorporated some relevant knowledge about pronominal anaphora that the baseline ME clas– – – – – sifier and earlier versions of our network have no access to. This is particularly remarkable because no data manually annotated for coreference was used for training. 8 Related work Even though it was recognised years ago that the information contained in parallel corpora may provide valuable information for the improvement of anaphora resolution systems, there have not been many attempts to cash in on this insight. Mitkov and Barbu (2003) exploit parallel data in English and French to improve pronominal anaphora resolution by combining anaphora resolvers for the individual languages with handwritten rules to resolve conflicts between the output of the language-specific resolvers. Veselovská et al. (2012) apply a similar strategy to English-Czech data to resolve different uses of the pronoun it. Other work has used word alignments to project coreference annotations from one language to another with a view to training anaphora resolvers in the target language (Postolache et al., 2006; de Souza and Or˘ asan, 2011). Rahman and Ng (2012) instead use machine translation to translate their test 389 data into a language for which they have an anaphora resolver and then project the annotations back to the original language. Completely unsupervised monolingual anaphora resolution has been approached using, e. g., Markov logic (Poon and Domingos, 2008) and the Expectation-Maximisation algorithm (Cherry and Bergsma, 2005; Charniak and Elsner, 2009). To the best of our knowledge, the direct application of machine learning techniques to parallel data in a task related to anaphora resolution is novel in our work. Neural networks and deep learning techniques have recently gained some popularity in natural language processing. They have been applied to tasks such as language modelling (Bengio et al., 2003; Schwenk, 2007), translation modelling in statistical machine translation (Le et al., 2012), but also part-ofspeech tagging, chunking, named entity recognition and semantic role labelling (Collobert et al., 2011). In tasks related to anaphora resolution, standard feedforward neural networks have been tested as a classifier in an anaphora resolution system (Stuckardt, 2007), but the network design presented in our work is novel. 9 Conclusion In this paper, we have introduced cross-lingual pronoun prediction as an independent natural language processing task. Even though it is not an end-to-end task, pronoun prediction is interesting for several reasons. It is related to the problem of pronoun translation in SMT, a currently unsolved problem that has been addressed in a number of recent research publications (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) without reaching a majorbreakthrough. In this work, we have shown that pronoun prediction can be effectively modelled in a neural network architecture with relatively simple features. More importantly, we have demonstrated that the task can be exploited to train a classifier with a latent representation of anaphoric links. With parallel text as its only supervision this classifier achieves a level of performance that is similar to, if not better than, that of a classifier using a regular anaphora resolution system trained with manually annotated data. References Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1 155. Samuel Broscheit, Massimo Poesio, Simone Paolo Ponzetto, Kepa Joseba Rodriguez, Lorenza Romano, Olga Uryupina, Yannick Versley, and Roberto Zanoli. 2010. BART: A multilingual anaphora resolution system. In Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2010), Uppsala, Sweden, 15–16 July 2010. Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Associationfor Machine Translation (EAMT), pages 261–268, Trento, Italy. Eugene Charniak and Micha Elsner. 2009. EM works for pronoun anaphora resolution. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 148–156, Athens, Greece. Colin Cherry and Shane Bergsma. 2005. An Expectation Maximization approach to pronoun resolution. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 88– 95, Ann Arbor, Michigan. Michael Collins. 1999. Head-Driven Statistical Models forNatural Language Parsing. Ph.D. thesis, University of Pennsylvania. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal ofMachine Learning Research, 12:2461–2505. José de Souza and Constantin Or˘ asan. 2011. Can projected chains in parallel corpora help coreference resolution? In Iris Hendrickx, Sobha Lalitha Devi, António Branco, and Ruslan Mitkov, editors, Anaphora Processing and Applications, volume 7099 of Lecture Notes in Computer Science, pages 59–69. Springer, Berlin. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Associationfor Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Dan Klein and Christopher D. Manning. 390 2003. Accu- rate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Associationfor Computational Linguistics, pages 423–430, Sapporo, Japan. Hai-Son Le, Alexandre Allauzen, and François Yvon. 2012. Continuous space translation models with neural networks. In Proceedings ofthe 2012 Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 39–48, Montréal, Canada. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Ruslan Mitkov and Catalina Barbu. 2003. Using bilingual corpora to improve pronoun resolution. Languages in Contrast, 4(2):201–21 1. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29: 19–51. Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 650– 659, Honolulu, Hawaii. Oana Postolache, Dan Cristea, and Constantin Or˘ asan. 2006. Transferring coreference chains through word alignment. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC-2006), pages 889–892, Genoa. Altaf Rahman and Vincent Ng. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the 2012 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 720– 730, Montréal, Canada. Benoît Sagot, Lionel Clément, Éric Villemonte de La Clergerie, and Pierre Boullier. 2006. The Lefff 2 syntactic lexicon for French: architecture, acquisition, use. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC2006), pages 1348–1351, Genoa. Holger Schwenk. 2007. Continuous space language models. Computer Speech and Language, 21(3):492–5 18. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4):521–544. Roland Stuckardt. 2007. Applying backpropagation networks to anaphor resolution. In António Branco, editor, Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Collo- 2007, number 4410 in Lecture Notes in Artificial Intelligence, pages 107–124, Berlin. Kate ˇrina Veselovská, Ngu.y Giang Linh, and Michal Novák. 2012. Using Czech-English parallel corpora in quium, DAARC automatic identification of it. In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pages 112–120, Istanbul, Turkey. 391

5 0.51419532 65 emnlp-2013-Document Summarization via Guided Sentence Compression

Author: Chen Li ; Fei Liu ; Fuliang Weng ; Yang Liu

Abstract: Joint compression and summarization has been used recently to generate high quality summaries. However, such word-based joint optimization is computationally expensive. In this paper we adopt the ‘sentence compression + sentence selection’ pipeline approach for compressive summarization, but propose to perform summary guided compression, rather than generic sentence-based compression. To create an annotated corpus, the human annotators were asked to compress sentences while explicitly given the important summary words in the sentences. Using this corpus, we train a supervised sentence compression model using a set of word-, syntax-, and documentlevel features. During summarization, we use multiple compressed sentences in the integer linear programming framework to select . salient summary sentences. Our results on the TAC 2008 and 2011 summarization data sets show that by incorporating the guided sentence compression model, our summarization system can yield significant performance gain as compared to the state-of-the-art.

6 0.509507 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

7 0.48529154 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

8 0.47910434 45 emnlp-2013-Chinese Zero Pronoun Resolution: Some Recent Advances

9 0.47757596 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

10 0.47415027 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

11 0.47378632 118 emnlp-2013-Learning Biological Processes with Global Constraints

12 0.46927977 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

13 0.46046203 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks

14 0.45795995 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

15 0.45613733 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

16 0.4558042 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations

17 0.44384614 156 emnlp-2013-Recurrent Continuous Translation Models

18 0.44303754 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

19 0.43424588 160 emnlp-2013-Relational Inference for Wikification

20 0.4281418 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction