emnlp emnlp2013 emnlp2013-93 knowledge-graph by maker-knowledge-mining

93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

Source: pdf

Author: Congle Zhang ; Daniel S. Weld

Abstract: The distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings, has inspired several Web mining algorithms for paraphrasing semantically equivalent phrases. Unfortunately, these methods have several drawbacks, such as confusing synonyms with antonyms and causes with effects. This paper introduces three Temporal Correspondence Heuristics, that characterize regularities in parallel news streams, and shows how they may be used to generate high precision paraphrases for event relations. We encode the heuristics in a probabilistic graphical model to create the NEWSSPIKE algorithm for mining news streams. We present experiments demonstrating that NEWSSPIKE significantly outperforms several competitive baselines. In order to spur further research, we provide a large annotated corpus of timestamped news arti- cles as well as the paraphrases produced by NEWSSPIKE.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings, has inspired several Web mining algorithms for paraphrasing semantically equivalent phrases. [sent-4, score-0.327]

2 This paper introduces three Temporal Correspondence Heuristics, that characterize regularities in parallel news streams, and shows how they may be used to generate high precision paraphrases for event relations. [sent-6, score-0.705]

3 We encode the heuristics in a probabilistic graphical model to create the NEWSSPIKE algorithm for mining news streams. [sent-7, score-0.337]

4 In order to spur further research, we provide a large annotated corpus of timestamped news arti- cles as well as the paraphrases produced by NEWSSPIKE. [sent-9, score-0.476]

5 1 Introduction Paraphrasing, the task of finding sets of semantically equivalent surface forms, is crucial to many natural language processing applications, including relation extraction (Bhagat and Ravichandran, 2008), question answering (Fader et al. [sent-10, score-0.28]

6 While the benefits of paraphrasing have been demonstrated, creating a large-scale corpus of high precision paraphrases remains a challenge especially for event relations. [sent-14, score-0.612]

7 For example, DIRT (Lin and Pantel, 2001) and Resolver (Yates and Etzioni, 2009) identify synonymous relation phrases by the distributions of their arguments. [sent-16, score-0.325]

8 It remains unclear how to accurately paraphrase less frequent relations with the distributional hypothesis. [sent-22, score-0.297]

9 This peculiar property allows the use of the temporal assumption, which assumes that phrases in articles published at the same time tend to have similar meanings. [sent-25, score-0.362]

10 (2003) identify pairs of sentential paraphrases in similar articles that have appeared in the same period of time. [sent-28, score-0.338]

11 While these approaches use temporal information as a coarse filter in the data generation stage, they still largely rely on text metrics in the prediction stage. [sent-29, score-0.266]

12 This not only reduces precision, but also limits the discovery of paraphrases with dissimilar sur1http : / / demo . [sent-30, score-0.267]

13 The goal of our research is to develop a technique to generate paraphrases for large numbers of event relation with high precision, using only minimal human effort. [sent-36, score-0.585]

14 The key to our approach is a joint cluster model using the temporal attributes of news streams, which allows us to identify semantic equivalence of event relation phrases with greater precision. [sent-37, score-0.955]

15 In summary, this paper makes the following contributions: • We formulate a set of three temporal correspondence aheteur aist sicest tohfat t hcrheaera tcetmerpizoer regularities over parallel news streams. [sent-38, score-0.523]

16 • We present a series of detailed experiments • 2 demonstrating stheratie sN oEfWS deStPaiIlKeEd outperforms several competitive baselines, and show through ablation tests how each of the temporal heuristics affects performance. [sent-41, score-0.385]

17 5M time-stamped news articles2, collected over a period of about 50 days from hundreds of news sources. [sent-43, score-0.348]

18 System Overview The main goal of this work is to generate high precision paraphrases for relation phrases. [sent-44, score-0.483]

19 News streams are a promising resource, since articles from different sources tend to use semantically equivalent phrases to describe the same daily events. [sent-45, score-0.381]

20 From these we can conclude that the following relation phrases are semantically similar: {step downfrom, resignfrom, cut ties with}. [sent-47, score-0.314]

21 Next, an extracted event candidate (EEC) is obtained after grouping daily extractions by argument pairs. [sent-55, score-0.282]

22 Temporal features and constraints are developed based on our temporal correspondence heuristics and encoded into a joint inference model. [sent-56, score-0.481]

23 The model finally creates the paraphrase clusters by predicting the relation phrases that describe the EEC. [sent-57, score-0.511]

24 , 2011) on the news streams to obtain a set of (a1, r, a2, t) tuples, where the ai are the arguments, r is a relation phrase, and t is the time-stamp of the corresponding news article. [sent-63, score-0.684]

25 When (a1, a2, t) suggests a real word event, the relation r of (a1, r, a2, t) is likely to describe that event (e. [sent-64, score-0.354]

26 We call every (a1, a2, t) an extracted event candidate (EEC), and every relation describing the event an event-mention. [sent-67, score-0.526]

27 All the event-mentions in the EEC-,ste,t may be semantically equivalent taiondn are hence candidates for a good paraphrase cluster. [sent-76, score-0.248]

28 Thus, the paraphrasing problem becomes a prediction problem: for each relation ri in the EEC-set, does it or does it not describe the hypothesized event? [sent-77, score-0.321]

29 The next section proposes a set of temporal correspondence heuristics that partially characterize semantically equivalent EEC-sets. [sent-79, score-0.504]

30 Then, in Section 4, we present a joint inference model designed to use these heuristics to solve the prediction problem and to generate paraphrase clusters. [sent-80, score-0.304]

31 3 Temporal Correspondence Heuristics In this section, we propose a set of temporal heuristics that are useful to generate paraphrases at high precision. [sent-81, score-0.579]

32 For example, the two sentences “Armstrong was the chairman of Livestrong ” and “Armstrong steps down from Livestrong” have past and present tense re— spectively, which suggests that the relation phrases are less likely to describe the same event and are thus not semantically equivalent. [sent-84, score-0.634]

33 Since two event mentions from such a mixture are much less likely to denote the same event or relation, we wish to distinguish them from the better (semantically homogeneous) EECs like the (Armstrong, Livestrong) example. [sent-92, score-0.344]

34 Therefore we can judge whether an entity pair is good for paraphrasing by looking at the history of the frequencies that the entity pair is mentioned in the news streams, which is the time series of that entity pair. [sent-95, score-0.446]

35 If an entity or an entity pair appears significantly more frequently in one day ’s news than in recent history, the corresponding event candidates are likely to be good to generate paraphrase. [sent-98, score-0.44]

36 The temporal burstiness heuristic implies that a good EEC (a1, a2, t) tends to have a spike in the time series of its entities ai, or argument pair (a1, a2), on day t. [sent-99, score-0.616]

37 However, even if we have selected a good EEC for paraphrasing, it is likely that it contains a few relation phrases that are related to (but not synonymous with) the other relations included in the EEC. [sent-100, score-0.382]

38 ” and so both “steps down from” and “is the founder of” relation phrases would be part of the same EEC-set. [sent-103, score-0.298]

39 The one event-mention per discourse heuristic is proposed in order to gain precision at the expense of recall the heuristic directs an algorithm to choose, from a news story, the single “best” relation phrase connecting a pair of two entities. [sent-107, score-0.634]

40 4 Exploiting the Temporal Heuristics In this section we propose several models to capture the temporal correspondence heuristics, and discuss their pros and cons. [sent-111, score-0.291]

41 This baseline model captures the most of the temporal functionality heuristic, except for the tense requirement. [sent-118, score-0.381]

42 2 Pairwise Model The temporal functionality heuristic suggests we exploit the tenses of the relations in an EEC-set; while the temporal burstiness heuristic suggests we exploit the time series of its arguments. [sent-126, score-0.909]

43 The tenses of the relations and time series of the arguments are encoded as features, which we call tense features and spike features respectively. [sent-131, score-0.364]

44 An example tense feature is whether one relation is past tense while the other relation is present tense; an example spike feature is the covariance of the time series. [sent-132, score-0.66]

45 The pairwise model can be considered similar to paraphrasing techniques which examine two sentences and determine whether they are semantically equivalent (Dolan and Brockett, 2005; Socher et al. [sent-133, score-0.323]

46 Unfortunately, these techniques often based purely on text metrics and does not consider any temporal attributes. [sent-135, score-0.266]

47 A common approach to overcome the drawbacks of the pairwise model and to combine heuristics together is to introduce a joint cluster model, in which heuristics are encoded as features and constraints. [sent-145, score-0.507]

48 rm}), we introduce one event variable an,dt m relation v}a)r,i wabeleisn,t raollboolean valued. [sent-158, score-0.354]

49 The event variable indicates whether (a1, a2, t) is a good event for paraphrasing. [sent-159, score-0.344]

50 It is designed in accordance with the temporal burstiness heuristic: for the EEC (Barack Obama, the White House, Oct 1 Z should be as7), signed the value 0. [sent-160, score-0.338]

51 The relation variable Yr indicates whether relation r describes the EEC (a1, a2 , t) or not (i. [sent-161, score-0.364]

52 The set of all event-mentions with Yr = 1 define a paraphrase cluster, containing relation phrases. [sent-164, score-0.332]

53 For example, the assignments Ystep down = Yresign from = 1produce a paraphrase cluster {step down, resign from}. [sent-165, score-0.286]

54 2 Factors and the Joint Distribution In this section, we introduce a conditional probability model defining a joint distribution over all of the event and relation variables. [sent-168, score-0.393]

55 Our model contains event factors, relation factors and joint factors. [sent-170, score-0.464]

56 The event factor ΦZ is a log-linear function with spike features, used to distinguish good events. [sent-171, score-0.268]

57 A relation factor can also be defined for a pair of relation variables (e. [sent-176, score-0.404]

58 Φ2Y in Figure 2) with features capturing the pairwise evidence for paraphrasing, such as if two relation phrases have the same tense. [sent-178, score-0.339]

59 The joint factors Φjoint are defined to apply constraints implied by the temporal heuristics. [sent-179, score-0.343]

60 The joint distribution is: 3Relation phrases in clausal complement are less useful for paraphrasing because they often do not describe a fact. [sent-186, score-0.287]

61 1780 p(Z = z,Y = y|x;Θ)=defZ1xΦZ(z,x) YΦjoint(z,yd,x)YΦY(yi,yj,x) Yd Yi,j where yd indicates the subset of relation variables from a particular article d, and the parameter vector Θ is the weight vector of the features in ΦZ and ΦY, which are log-linear functions; i. [sent-188, score-0.286]

62 The joint factors Φjoint are used to apply the temporal burstiness heuristic and the one event-mention per discourse heuristic. [sent-191, score-0.577]

63 The objective function is the sum of logs of the event and relation factors ΦZ and ΦY. [sent-201, score-0.425]

64 The temporal burstiness heuristic of Φjoint is encoded as a linear inequality constraint z ≥ yi; the oennceo-mdeednt aiosn a per adris incoequrusael htyeu croisnstitcr oifn tΦ zjo ≥int yis encoded as the constraint Pyi∈yd yi ≤ 1. [sent-202, score-0.529]

65 3 respectively: First, does the NEWSSPIKE algorithm effectively exploit the proposed heuristics and outperform other approaches which also use news streams? [sent-219, score-0.289]

66 Secondly, do the proposed temporal heuristics paraphrase relations with greater precision than the distributional hypothesis? [sent-220, score-0.715]

67 1 Experimental Setup Since we were unable to find any elaborate timestamped, parallel, news corpus, we collected data using the following procedure: • Collect RSS news seeds, which contain the title, time-stamp, a nnedw asb ssetreadcst, wofh hthiceh news iitnem thse. [sent-222, score-0.522]

68 • Use these titles to query the Bing news search engine eAsePI t talneds ctooll qeucetr ayd tdhiteio Bnianlg time-stamped news articles. [sent-223, score-0.348]

69 Text and semantic features are encoded using the relation factors of section 4. [sent-232, score-0.289]

70 Human annotators created gold paraphrase clusters for 500 EEC-sets; note that some EEC-sets yield no gold cluster, since at least two synonymous phrases. [sent-240, score-0.33]

71 Two annotators were shown a set of candi- date relation phrases in context and asked to select a subset of these that described a shared event (if one existed). [sent-241, score-0.425]

72 Since counting these trivial paraphrases tends to exaggerate the performance of a system, we also report precision and recall on diverse clusters i. [sent-247, score-0.457]

73 , those whose relation phrases all have different head verbs. [sent-249, score-0.253]

74 Given sentential paraphrases, aligning relation phrases is natural, because OpenIE has already identified the relation phrases. [sent-266, score-0.484]

75 This demonstrates that the EEC-sets generated from the news streams are a promising resource for paraphrasing. [sent-272, score-0.328]

76 This is probably due to the fact that Socher’s method is based purely on text metrics and does not consider any temporal attributes. [sent-274, score-0.266]

77 Taking into account the features used by NEWSSPIKE, Pairwise significantly improves the precision, which demonstrates the power of our temporal correspondence heuristics. [sent-275, score-0.291]

78 Our joint cluster model, NEWSSPIKE, which considers both temporal features and constraints, gets the best performance in both conditions. [sent-276, score-0.356]

79 We conducted ablation testing to evaluate how spike features and tense features, which are particularly relevant to the temporal aspects of news streams, can improve performance. [sent-277, score-0.603]

80 We ran NEWSSPIKE over all EEC-sets except for the development set and compared to the following systems: Resolver: Resolver (Yates and Etzioni, 2009) uses a set of extraction tuples in the form of (a1, r, a2) as the input and creates a set of relation clusters as the output paraphrases. [sent-285, score-0.334]

81 We ran Resolver on the union of this and our standard test set, but report performance only on clusters whose relations were seen in our news stream. [sent-293, score-0.339]

82 CosineNYT: As for ResolverNYT, we ran CosineNYT with an extra 60 million extractions and re- ported the performance on relations seen in our news stream. [sent-303, score-0.305]

83 It is possible that argument pairs from news streams spanning 20 years sometimes provide incorrect evidence for paraphrasing. [sent-316, score-0.364]

84 For example, there were extractions like (the Rangers, be third in, the NHL) and (the Rangers, be fourth in, the NHL) from news in 2007 and 2003 respectively. [sent-317, score-0.248]

85 NucEedWS thSeP iInKcEo rarcehctie cvluess greater precini,si boen tohuarnt even t hNe best results from ResolverNytTop, because NEWSSPIKE successfully captures the temporal heuristics, and does not confuse synonyms with antonyms, or causes with effects. [sent-319, score-0.309]

86 4 Discussion Unlike some domain-specific clustering methods, we tested on all relation phrases extracted by OpenIE on the collected news streams. [sent-322, score-0.427]

87 These high precision clusters can contribute a lot to generate larger paraphrase clusters. [sent-329, score-0.328]

88 While this paper gives promising results, there are still behaviors found in news streams that prove challenging. [sent-332, score-0.328]

89 Many errors are due to the discourse context: the two sentences are synonymous in the given EEC-set, but the relation phrases are not paraphrases in general. [sent-333, score-0.606]

90 6 Related Work The vast majority of paraphrasing work falls into two categories: approaches based on the distributional hypothesis or those exploiting on correspondences between parallel corpora (Androutsopoulos and Malakasiotis, 2010; Madnani and Dorr, 2010). [sent-339, score-0.331]

91 Identifying the semantic equivalence of relation phrases is also called relation discovery or unsupervised semantic parsing. [sent-342, score-0.471]

92 Using Parallel Corpora: Comparable and parallel corpora, including news streams and multiple translations of the same story, have been used to generate paraphrases, both sentential (Barzilay and Lee, 2003; Dolan et al. [sent-351, score-0.435]

93 While prior work uses the temporal aspects of news streams as a coarse filter, it largely relies on text metrics, such as context similarity and edit distance, to make predictions and alignments. [sent-359, score-0.604]

94 These metrics are usually insufficient to produce high precision results; moreover they tend to produce paraphrases that are simple lexical variants (e. [sent-360, score-0.334]

95 (2012) also uses temporal information to detect the semantics of entities. [sent-372, score-0.233]

96 7 Conclusion Paraphrasing event relations is crucial to many natural language processing applications, including relation extraction, question answering, summarization, and machine translation. [sent-375, score-0.411]

97 This paper introduces three Temporal Correspondence Heuristics that characterize semantically equivalent phrases in news streams. [sent-377, score-0.343]

98 We present a novel algorithm, NEWSSPIKE, based on a probabilistic graphical model encoding these heuristics, which harvests high-quality paraphrases of event relations. [sent-378, score-0.451]

99 In order to spur future research, we are releasing an annotated corpus of time-stamped news articles and our harvested relation clusters. [sent-381, score-0.455]

100 Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. [sent-441, score-0.382]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('newsspike', 0.45), ('temporal', 0.233), ('paraphrases', 0.231), ('armstrong', 0.211), ('livestrong', 0.208), ('relation', 0.182), ('news', 0.174), ('event', 0.172), ('eec', 0.165), ('streams', 0.154), ('resolver', 0.154), ('paraphrase', 0.15), ('paraphrasing', 0.139), ('heuristics', 0.115), ('clusters', 0.108), ('burstiness', 0.105), ('oct', 0.104), ('tense', 0.1), ('spike', 0.096), ('dolan', 0.092), ('distributional', 0.09), ('resolvernyt', 0.087), ('pairwise', 0.086), ('cluster', 0.084), ('heuristic', 0.079), ('extractions', 0.074), ('synonymous', 0.072), ('phrases', 0.071), ('factors', 0.071), ('precision', 0.07), ('fader', 0.063), ('semantically', 0.061), ('parallel', 0.058), ('correspondence', 0.058), ('articles', 0.058), ('relations', 0.057), ('yr', 0.055), ('barack', 0.055), ('socher', 0.054), ('xi', 0.053), ('barzilay', 0.053), ('cosinenyt', 0.052), ('resign', 0.052), ('resolvernyttop', 0.052), ('rgold', 0.052), ('ygiold', 0.052), ('yates', 0.051), ('discourse', 0.05), ('sentential', 0.049), ('rm', 0.049), ('house', 0.049), ('functionality', 0.048), ('chairman', 0.048), ('diverse', 0.048), ('graphical', 0.048), ('obama', 0.047), ('synonyms', 0.046), ('founder', 0.045), ('openie', 0.045), ('antonyms', 0.044), ('hypothesis', 0.044), ('tuples', 0.044), ('entailment', 0.044), ('similarity', 0.043), ('etzioni', 0.041), ('spur', 0.041), ('yi', 0.04), ('variables', 0.04), ('joint', 0.039), ('white', 0.039), ('tenses', 0.038), ('clausal', 0.038), ('dirt', 0.038), ('series', 0.037), ('oren', 0.037), ('equivalent', 0.037), ('assignment', 0.036), ('discovery', 0.036), ('encoded', 0.036), ('argument', 0.036), ('esp', 0.035), ('nhl', 0.035), ('ofinference', 0.035), ('rangers', 0.035), ('zgiold', 0.035), ('dagan', 0.034), ('zi', 0.033), ('turning', 0.033), ('metrics', 0.033), ('regina', 0.032), ('article', 0.032), ('entity', 0.032), ('yd', 0.032), ('drawbacks', 0.032), ('hoffmann', 0.032), ('anthony', 0.032), ('ilp', 0.032), ('day', 0.03), ('causes', 0.03), ('timestamped', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

Author: Congle Zhang ; Daniel S. Weld

2 0.19488087 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

Author: Jun-Ping Ng ; Min-Yen Kan ; Ziheng Lin ; Wei Feng ; Bin Chen ; Jian Su ; Chew Lim Tan

Abstract: In this paper we classify the temporal relations between pairs of events on an article-wide basis. This is in contrast to much of the existing literature which focuses on just event pairs which are found within the same or adjacent sentences. To achieve this, we leverage on discourse analysis as we believe that it provides more useful semantic information than typical lexico-syntactic features. We propose the use of several discourse analysis frameworks, including 1) Rhetorical Structure Theory (RST), 2) PDTB-styled discourse relations, and 3) topical text segmentation. We explain how features derived from these frameworks can be effectively used with support vector machines (SVM) paired with convolution kernels. Experiments show that our proposal is effective in improving on the state-of-the-art significantly by as much as 16% in terms of F1, even if we only adopt less-than-perfect automatic discourse analyzers and parsers. Making use of more accurate discourse analysis can further boost gains to 35%.

3 0.18475361 41 emnlp-2013-Building Event Threads out of Multiple News Articles

Author: Xavier Tannier ; Veronique Moriceau

Abstract: We present an approach for building multidocument event threads from a large corpus of newswire articles. An event thread is basically a succession of events belonging to the same story. It helps the reader to contextualize the information contained in a single article, by navigating backward or forward in the thread from this article. A specific effort is also made on the detection of reactions to a particular event. In order to build these event threads, we use a cascade of classifiers and other modules, taking advantage of the redundancy of information in the newswire corpus. We also share interesting comments concerning our manual annotation procedure for building a training and testing set1.

4 0.18306121 118 emnlp-2013-Learning Biological Processes with Global Constraints

Author: Aju Thalappillil Scaria ; Jonathan Berant ; Mengqiu Wang ; Peter Clark ; Justin Lewis ; Brittany Harding ; Christopher D. Manning

Abstract: Biological processes are complex phenomena involving a series of events that are related to one another through various relationships. Systems that can understand and reason over biological processes would dramatically improve the performance of semantic applications involving inference such as question answering (QA) – specifically “How? ” and “Why? ” questions. In this paper, we present the task of process extraction, in which events within a process and the relations between the events are automatically extracted from text. We represent processes by graphs whose edges describe a set oftemporal, causal and co-reference event-event relations, and characterize the structural properties of these graphs (e.g., the graphs are connected). Then, we present a method for extracting relations between the events, which exploits these structural properties by performing joint in- ference over the set of extracted relations. On a novel dataset containing 148 descriptions of biological processes (released with this paper), we show significant improvement comparing to baselines that disregard process structure.

5 0.17636541 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles

Author: Tao Ge ; Baobao Chang ; Sujian Li ; Zhifang Sui

Abstract: Since many applications such as timeline summaries and temporal IR involving temporal analysis rely on document timestamps, the task of automatic dating of documents has been increasingly important. Instead of using feature-based methods as conventional models, our method attempts to date documents in a year level by exploiting relative temporal relations between documents and events, which are very effective for dating documents. Based on this intuition, we proposed an eventbased time label propagation model called confidence boosting in which time label information can be propagated between documents and events on a bipartite graph. The experiments show that our event-based propagation model can predict document timestamps in high accuracy and the model combined with a MaxEnt classifier outperforms the state-ofthe-art method for this task especially when the size of the training set is small.

6 0.17485996 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

7 0.14197156 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

8 0.11901019 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

9 0.11739421 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter

10 0.11147878 192 emnlp-2013-Unsupervised Induction of Contingent Event Pairs from Film Scenes

11 0.10664304 197 emnlp-2013-Using Paraphrases and Lexical Semantics to Improve the Accuracy and the Robustness of Supervised Models in Situated Dialogue Systems

12 0.10435805 152 emnlp-2013-Predicting the Presence of Discourse Connectives

13 0.10318698 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

14 0.09286461 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

15 0.086546108 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

16 0.086412787 75 emnlp-2013-Event Schema Induction with a Probabilistic Entity-Driven Model

17 0.084796391 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

18 0.083874151 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision

19 0.081694581 160 emnlp-2013-Relational Inference for Wikification

20 0.071819589 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.259), (1, 0.158), (2, -0.011), (3, 0.252), (4, 0.026), (5, -0.052), (6, -0.162), (7, -0.023), (8, -0.093), (9, -0.028), (10, 0.145), (11, 0.044), (12, 0.033), (13, 0.072), (14, 0.012), (15, -0.009), (16, -0.03), (17, 0.067), (18, 0.047), (19, 0.084), (20, -0.028), (21, -0.027), (22, -0.003), (23, -0.074), (24, 0.058), (25, 0.108), (26, -0.061), (27, -0.003), (28, -0.016), (29, 0.129), (30, 0.057), (31, 0.112), (32, 0.032), (33, 0.086), (34, 0.067), (35, -0.051), (36, 0.071), (37, -0.112), (38, 0.034), (39, -0.017), (40, -0.075), (41, -0.194), (42, -0.043), (43, 0.101), (44, 0.128), (45, -0.025), (46, -0.038), (47, -0.052), (48, 0.058), (49, -0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95284891 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

Author: Congle Zhang ; Daniel S. Weld

2 0.66455775 41 emnlp-2013-Building Event Threads out of Multiple News Articles

Author: Xavier Tannier ; Veronique Moriceau

3 0.6401279 118 emnlp-2013-Learning Biological Processes with Global Constraints

Author: Aju Thalappillil Scaria ; Jonathan Berant ; Mengqiu Wang ; Peter Clark ; Justin Lewis ; Brittany Harding ; Christopher D. Manning

4 0.61921805 74 emnlp-2013-Event-Based Time Label Propagation for Automatic Dating of News Articles

Author: Tao Ge ; Baobao Chang ; Sujian Li ; Zhifang Sui

5 0.600133 76 emnlp-2013-Exploiting Discourse Analysis for Article-Wide Temporal Classification

Author: Jun-Ping Ng ; Min-Yen Kan ; Ziheng Lin ; Wei Feng ; Bin Chen ; Jian Su ; Chew Lim Tan

6 0.59593308 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

7 0.55337936 192 emnlp-2013-Unsupervised Induction of Contingent Event Pairs from Film Scenes

8 0.50013804 152 emnlp-2013-Predicting the Presence of Discourse Connectives

9 0.49985763 197 emnlp-2013-Using Paraphrases and Lexical Semantics to Improve the Accuracy and the Robustness of Supervised Models in Situated Dialogue Systems

10 0.48184872 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment

11 0.47624311 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

12 0.46358627 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

13 0.44380695 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision

14 0.42436782 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

15 0.4064475 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

16 0.39517549 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

17 0.39206442 137 emnlp-2013-Multi-Relational Latent Semantic Analysis

18 0.39008591 123 emnlp-2013-Learning to Rank Lexical Substitutions

19 0.38387823 165 emnlp-2013-Scaling to Large3 Data: An Efficient and Effective Method to Compute Distributional Thesauri

20 0.38357082 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.033), (18, 0.039), (22, 0.058), (30, 0.047), (50, 0.012), (51, 0.143), (66, 0.03), (71, 0.026), (75, 0.457), (77, 0.018), (95, 0.013), (96, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91540623 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

Author: Lifu Huang ; Lian'en Huang

Abstract: Recently, much research focuses on event storyline generation, which aims to produce a concise, global and temporal event summary from a collection of articles. Generally, each event contains multiple sub-events and the storyline should be composed by the component summaries of all the sub-events. However, different sub-events have different part-whole relationship with the major event, which is important to correspond to users’ interests but seldom considered in previous work. To distinguish different types of sub-events, we propose a mixture-event-aspect model which models different sub-events into local and global aspects. Combining these local/global aspects with summarization requirements together, we utilize an optimization method to generate the component summaries along the timeline. We develop experimental systems on 6 distinctively different datasets. Evaluation and comparison results indicate the effectiveness of our proposed method.

same-paper 2 0.87461621 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

Author: Congle Zhang ; Daniel S. Weld

3 0.85001361 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

Author: Aliaksei Severyn ; Alessandro Moschitti

Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.

4 0.79950535 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

Author: Christian Hardmeier ; Jorg Tiedemann ; Joakim Nivre

Abstract: This paper addresses the task of predicting the correct French translations of third-person subject pronouns in English discourse, a problem that is relevant as a prerequisite for machine translation and that requires anaphora resolution. We present an approach based on neural networks that models anaphoric links as latent variables and show that its performance is competitive with that of a system with separate anaphora resolution while not requiring any coreference-annotated training data. This demonstrates that the information contained in parallel bitexts can successfully be used to acquire knowledge about pronominal anaphora in an unsupervised way. 1 Motivation When texts are translated from one language into another, the translation reconstructs the meaning or function of the source text with the means of the target language. Generally, this has the effect that the entities occurring in the translation and their mutual relations will display similar patterns as the entities in the source text. In particular, coreference patterns tend to be very similar in translations of a text, and this fact has been exploited with good results to project coreference annotations from one language into another by using word alignments (Postolache et al., 2006; Rahman and Ng, 2012). On the other hand, what is true in general need not be true for all types of linguistic elements. For instance, a substantial percentage ofthe English thirdperson subject pronouns he, she, it and they does not get realised as pronouns in French translations (Hardmeier, 2012). Moreover, it has been recognised 380 by various authors in the statistical machine translation (SMT) community (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) that pronoun translation is a difficult problem because, even when a pronoun does get translated as a pronoun, it may require choosing the correct word form based on agreement features that are not easily pre- dictable from the source text. The work presented in this paper investigates the problem of cross-lingual pronoun prediction for English-French. Given an English pronoun and its discourse context as well as a French translation of the same discourse and word alignments between the two languages, we attempt to predict the French word aligned to the English pronoun. As far as we know, this task has not been addressed in the literature before. In our opinion, it is interesting for several reasons. By studying pronoun prediction as a task in its own right, we hope to contribute towards a better understanding of pronoun translation with a longterm view to improving the performance of SMT systems. Moreover, we believe that this task can lead to interesting insights about anaphora resolution in a multi-lingual context. In particular, we show in this paper that the pronoun prediction task makes it possible to model the resolution of pronominal anaphora as a latent variable and opens up a way to solve a task relying on anaphora resolution without using any data annotated for anaphora. This is what we consider the main contribution of our present work. We start by modelling cross-lingual pronoun pre- diction as an independent machine learning task after doing anaphora resolution in the source language (English) using the BART software (Broscheit et al., 2010). We show that it is difficult to achieve satisfactory performance with standard maximumProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 3t8ic0s–391, The latest version released in March is equipped with ...It is sold at ... La dernière version lancée en mars est dotée de ... • est vendue ... • Figure 1: Task setup entropy classifiers especially for low-frequency pronouns such as the French feminine plural pronoun elles. We propose a neural network classifier that achieves better precision and recall and manages to make reasonable predictions for all pronoun categories in many cases. We then go on to extend our neural network architecture to include anaphoric links as latent variables. We demonstrate that our classifier, now with its own source language anaphora resolver, can be trained successfully with backpropagation. In this setup, we no longer use the machine learning component included in the external coreference resolution system (BART) to predict anaphoric links. Anaphora resolution is done by our neural network classifier and requires only some quantity of word-aligned parallel data for training, completely obviating the need for a coreference-annotated training set. 2 Task Setup The overall setup of the classification task we address in this paper is shown in Figure 1. We are given an English discourse containing a pronoun along with its French translation and word alignments between the two languages, which in our case were computed automatically using a standard SMT pipeline with GIZA++ (Och and Ney, 2003). We focus on the four English third-person subject pronouns he, she, it and they. The output of the classifier is a multinomial distribution over six classes: the four French subject pronouns il, elle, ils and elles, corresponding to masculine and feminine singular and plural, respectively; the impersonal pronoun ce/c’, which occurs in some very frequent constructions such as c’est (it is); and a sixth class OTHER, which indicates that none of these pronouns was used. In general, a pronoun may be aligned to multiple words; in this case, a training example is counted as a positive example for a class if the target word occurs among the words aligned to the pronoun, irrespective of the presence of other 381 word candidate training ex. verseiol ena0 0 1 01 10 0 0 .0510 .50 p 12= . 910. 5.9 050 Figure 2: Antecedent feature aggregation aligned tokens. This task setup resembles the problem that an SMT system would have to solve to make informed choices when translating pronouns, an aspect oftranslation neglected by most existing SMT systems. An important difference between the SMT setup and our own classifiers is that we use context from humanmade translations for prediction. This potentially makes the task both easier and more difficult; easier, because the context can be relied on to be correctly translated, and more difficult, because human translators frequently create less literal translations than an SMT system would. Integrating pronoun prediction into the translation process would require significant changes to the standard SMT decoding setup in order to take long-range dependencies in the target language into account, which is why we do not address this issue in our current work. In all the experiments presented in this paper, we used features from two different sources: Anaphora context features describe the source language pronoun and its immediate context consisting of three words to its left and three words to its right. They are encoded as vectors whose dimensionality is equal to the source vocabulary size with a single non-zero component indicating the word referred to (one-hot vectors). Antecedent features describe an antecedent candidate. Antecedent candidates are represented by the target language words aligned to the syntactic head of the source language markable TED News ce 16.3 % 6.4 % elle 7.1 % 10.1 % elles 3.0 % 3.9 % il 17.1 % 26.5 % ils 15.6 % 15.1 % OTHER 40.9 % 38.0 % – – Table 1: Distribution of classes in the training data noun phrase as identified by the Collins head finder (Collins, 1999). The different handling of anaphora context features and antecedent features is due to the fact that we always consider a constant number of context words on the source side, whereas the number of word vectors to be considered depends on the number of antecedent candidates and on the number of target words aligned to each antecedent. The encoding of the antecedent features is illustrated in Figure 2 for a training example with two antecedent candidates translated to elle and la version, respectively. The target words are represented as one-hot vectors with the dimensionality of the target language vocabulary. These vectors are then averaged to yield a single vector per antecedent candidate. Finally, the vectors of all candidates for a given training example are weighted by the probabilities assigned to them by the anaphora resolver (p1 and p2) and summed to yield a single vector per training example. 3 Data Sets and External Tools We run experiments with two different test sets. The TED data set consists of around 2.6 million tokens of lecture subtitles released in the WIT3 corpus (Cettolo et al., 2012). The WIT3 training data yields 71,052 examples, which were randomly partitioned into a training set of 63,228 examples and a test set of 7,824 examples. The official WIT3 development and test sets were not used in our experiments. The news-commentary data set is version 6 of the parallel news-commentary corpus released as a part of the WMT 2011training data1 . It contains around 2.8 million tokens ofnews text and yields 3 1,017 data points, 1http: //www. statmt .org/wmt11/translation-task. html (3 July 2013). 382 which were randomly split into 27,900 training examples and 3,117 test instances. The distribution of the classes in the two training sets is shown in Table 1. One thing to note is the dominance of the OTHER class, which pools together such different phenomena as translations with other pronouns not in our list (e. g., celui-ci) and translations with full noun phrases instead of pronouns. Splitting this group into more meaningful subcategories is not straightforward and must be left to future work. The feature setup of all our classifiers requires the detection of potential antecedents and the extraction of features pairing anaphoric pronouns with antecedent candidates. Some of our experiments also rely on an external anaphora resolution component. We use the open-source anaphora resolver BART to generate this information. BART (Broscheit et al., 2010) is an anaphora resolution toolkit consisting of a markable detection and feature extraction pipeline based on a variety of standard natural language processing (NLP) tools and a machine learning component to predict coreference links including both pronominal anaphora and noun-noun coreference. In our experiments, we always use BART’s markable detection and feature extraction machinery. Markable detection is based on the identification of noun phrases in constituency parses generated with the Stanford parser (Klein and Manning, 2003). The set of features extracted by BART is an extension of the widely used mention-pair anaphora resolution feature set by Soon et al. (2001) (see below, Section 6). In the experiments of the next two sections, we also use BART to predict anaphoric links for pronouns. The model used with BART is a maximum entropy ranker trained on the ACE02-npaper corpus (LDC2003T1 1). In order to obtain a probability distribution over antecedent candidates rather than onebest predictions or coreference sets, we modified the ranking component with which BART resolves pronouns to normalise and output the scores assigned by the ranker to all candidates instead of picking the highest-scoring candidate. 4 Baseline Classifiers In order to create a simple, but reasonable baseline for our task, we trained a maximum entropy (ME) ce TED (Accuracy: 0.685) P R 0.593 0.728 F 0.654 elle 0.798 0.523 elles 0.812 0.164 il 0.764 0.550 ils 0.632 0.949 OTHER 0.724 0.692 News commentary (Accuracy: 0.576) ce elle elles il ils OTHER P 0.508 0.530 0.538 0.600 0.593 0.564 R 0.294 0.312 0.062 0.666 0.769 0.609 Table 2: Maximum entropy classifier results 0.632 0.273 0.639 0.759 0.708 F 0.373 0.393 0.111 0.631 0.670 0.586 TED (Accuracy: 0.700) P R ce 0.634 0.747 elle 0.756 0.617 elles 0.679 0.319 il 0.719 0.591 ils 0.663 0.940 OTHER 0.743 0.678 News commentary (Accuracy: 0.576) F 0.686 0.679 0.434 0.649 0.778 0.709 P 0.477 0.498 F 0.400 0.444 ce elle R 0.344 0.401 elles il ils OTHER 0.565 0.655 0.570 0.567 0.116 0.626 0.834 0.573 0.193 0.640 0.677 0.570 Table 3: Neural network classifier with anaphoras resolved by BART classifier with the MegaM software package2 using the features described in the previous section and the anaphora links found by BART. Results are shown in Table 2. The baseline results show an overall higher accuracy for the TED data than for the newscommentary data. While the precision is above 50 % in all categories and considerably higher in some, recall varies widely. The pronoun elles is particularly interesting. This is the feminine plural of the personal pronoun, and it usually corresponds to the English pronoun they, which is not marked for gender. In French, elles is a marked choice which is only used if the antecedent exclusively refers to females or feminine-gendered objects. The presence of a single item with masculine grammatical gender in the antecedent will trigger the use of the masculine plural pronoun ils instead. This distinction cannot be predicted from the English source pronoun or its context; making correct predictions requires knowledge about the antecedent of the pronoun. Moreover, elles is a low-frequency pronoun. There are only 1,909 occurrences of this pro2http : //www . umiacs .umd .edu/~hal/megam/ (20 June 2013). 383 noun in the TED training data, and 1,077 in the newscommentary training set. Because of these special properties of the feminine plural class, we argue that the performance of a classifier on elles is a good indicator ofhow well it can represent relevant knowledge about pronominal anaphora as opposed to overfitting to source contexts or acting on prior assumptions about class frequencies. In accordance with the general linguistic preference for ils, the classifier tends to predict ils much more often than elles when encountering an English plural pronoun. This is reflected in the fact that elles has much lower recall than ils. Clearly, the classifier achieves a good part of its accuracy by making ma- jority choices without exploiting deeper knowledge about the antecedents of pronouns. An additional experiment with a subset of 27,900 training examples from the TED data confirms that the difference between TED and news commentaries is not just an effect of training data size, but that TED data is genuinely easier to predict than news commentaries. In the reduced data TED condition, the classifier achieves an accuracy of 0.673. Precision and recall of all classifiers are much closer to the Figure 3: Neural network for pronoun prediction large-data TED condition than to the news commentary experiments, except for elles, where we obtain an F-score of 0.072 (P 0.818, R 0.038), indicating that small training data size is a serious problem for this low-frequency class. 5 Neural Network Classifier In the previous section, we saw that a simple multiclass maximum entropy classifier, while making correct predictions for much of the data set, has a significant bias towards making majority class decisions, relying more on prior assumptions about the frequency distribution of the classes than on antecedent features when handling examples of less frequent classes. In order to create a system that can be trained to rely more explicitly on antecedent information, we created a neural network classifier for our task. The introduction of a hidden layer should enable the classifier to learn abstract concepts such as gender and number that are useful across multiple output categories, so that the performance of sparsely represented classes can benefit from the training examples of the more frequent classes. The overall structure of the network is shown in Figure 3. As inputs, the network takes the same features that were available to the baseline ME classifier, based on the source pronoun (P) with three words of context to its left (L1 to L3) and three words to its right (R1 to R3) as well as the words aligned to the syntactic head words of all possible antecedent candidates as found by BART (A). All words are 384 encoded as one-hot vectors whose dimensionality is equal to the vocabulary size. If multiple words are aligned to the syntactic head of an antecedent candidate, their word vectors are averaged with uniform weights. The resulting vectors for each antecedent are then averaged with weights defined by the posterior distribution of the anaphora resolver in BART (p1 to p3). The network has two hidden layers. The first layer (E) maps the input word vectors to a low-dimensional representation. In this layer, the embedding weights for all the source language vectors (the pronoun and its 6 context words) are tied, so if two words are the same, they are mapped to the same lowerdimensional embedding irrespective of their position relative to the pronoun. The embedding of the antecedent word vectors is independent, as these word vectors represent target language words. The entire embedding layer is then mapped to another hidden layer (H), which is in turn connected to a softmax output layer (S) with 6 outputs representing the classes ce, elle, elles, il, ils and OTHER. The non-linearity of both hidden layers is the logistic sigmoid function, f(x) = 1/(1 + e−x). In all experiments reported in this paper, the dimensionality of the source and target language word embeddings is 20, resulting in a total embedding layer size of 160, and the size of the last hidden layer is equal to 50. These sizes are fairly small. In experiments with larger layer sizes, we were able to obtain similar, but no better results. The neural network is trained with mini-batch stochastic gradient descent with backpropagated gradients using the RMSPROP algorithm with crossentropy as the objective function.3 In contrast to standard gradient descent, RMSPROP normalises the magnitude of the gradient components by dividing them by a root-mean-square moving average. We found this led to faster convergence. Other features of our training algorithm include the use of momentum to even out gradient oscillations, adaptive learning rates for each weight as well as adaptation of the global learning rate as a function of current training progress. The network is regularised with an ‘2 weight penalty. Good settings of the initial learning rate and the weight cost parameter (both around 0.001 in most experiments) were found by manual experi- mentation. Generally, we train our networks for 300 epochs, compute the validation error on a held-out set of some 10 % of the training data after each epoch and use the model that achieved the lowest validation error for testing. Since the source context features are very informative and it is comparatively more difficult to learn from the antecedents, the network sometimes had a tendency to overfit to the source features and disregard antecedent information. We found that this problem can be solved effectively by presenting a part of the training without any source features, forcing the network to learn from the information contained in the antecedents. In all experiments in this paper, we zero out all source features (input layers P, L1to L3 and R1 to R3) with a probability of 50 % in each training example. At test time, no information is zeroed out. Classification results with this network are shown in Table 3. We note that the accuracy has increased slightly for the TED test set and remains exactly the same for the news commentary corpus. However, a closer look on the results for individual classes reveals that the neural network makes better predictions for almost all classes. In terms of F-score, the only class that becomes slightly worse is the OTHER class for the news commentary corpus because of lower recall, indicating that the neural network classifier is less biased towards using the uninformative OTHER 3Our training procedure is greatly inspired by a series of online lectures held by Geoffrey Hinton in 2012 (https : //www . coursera. .org/course/neuralnets, 10 September 2013). 385 category. Recall for elle and elles increases considerably, but especially for elles it is still quite low. The increase in recall comes with some loss in precision, but the net effect on F-score is clearly positive. 6 Latent Anaphora Resolution Considering Figure 1 again, we note that the bilingual setting of our classification task adds some information not available to the monolingual anaphora resolver that can be helpful when determining the correct antecedent for a given pronoun. Knowing the gender of the translation of a pronoun limits the set of possible antecedents to those whose translation is morphologically compatible with the target language pronoun. We can exploit this fact to learn how to resolve anaphoric pronouns without requiring data with manually annotated anaphoric links. To achieve this, we extend our neural network with a component to predict the probability of each antecedent candidate to be the correct antecedent (Figure 4). The extended network is identical to the previous version except for the upper left part dealing with anaphoric link features. The only difference between the two networks is the fact that anaphora resolution is now performed by a part of our neural network itself instead of being done by an external module and provided to the classifier as an input. In this setup, we still use some parts of the BART toolkit to extract markables and compute features. However, we do not make use of the machine learning component in BART that makes the actual predictions. Since this is the only component trained on coreference-annotated data in a typical BART configuration, no coreference annotations are used anywhere in our system even though we continue to rely on the external anaphora resolver for preprocessing to avoid implementing our own markable and feature extractors and to make comparison easier. For each candidate markable identified by BART’s preprocessing pipeline, the anaphora resolution model receives as input a link feature vector (T) describing relevant aspects of the antecedent candidateanaphora pair. This feature vector is generated by the feature extraction machinery in BART and includes a standard feature set for coreference resolution partially based on work by Soon et al. (2001). We use the following feature extractors in BART, each of Figure 4: Neural network with latent anaphora resolution which can generate multiple features: Anaphora mention type Gender match Number match String match Alias feature (Soon et al., 2001) Appositive position feature (Soon et al., 2001) Semantic class (Soon et al., 2001) – – – – – – – Semantic class match Binary distance feature Antecedent is first mention in sentence Our baseline set of features was borrowed wholesale from a working coreference system and includes some features that are not relevant to the task at hand, e. g., features indicating that the anaphora is a pronoun, is not a named entity, etc. After removing all features that assume constant values in the training set when resolving antecedents for the set of pronouns we consider, we are left with a basic set of 37 anaphoric link features that are fed as inputs to our network. These features are exactly the same as those available to the anaphora resolution classifier in the BART system used in the previous section. Each training example for our network can have an arbitrary number of antecedent candidates, each of which is described by an antecedent word vector (A) and by an anaphoric link vector (T). The anaphoric link features are first mapped to a regular hidden layer with logistic sigmoid units (U). The activations of the hidden units are then mapped to a single value, which – – – 386 functions as an element in a softmax layer over all an- tecedent candidates (V). This softmax layer assigns a probability to each antecedent candidate, which we then use to compute a weighted average over the antecedent word vector, replacing the probabilities pi in Figures 2 and 3. At training time, the network’s anaphora resolution component is trained in exactly the same way as the rest of the network. The error signal from the embedding layer is backpropagated both to the weight matrix defining the antecedent word embedding and to the anaphora resolution subnetwork. Note that the number of weights in the network is the same for all training examples even though the number of antecedent candidates varies because all weights related to antecedent word features and anaphoric link features are shared between all antecedent candidates. One slightly uncommon feature of our neural network is that it contains an internal softmax layer to generate normalised probabilities over all possible antecedent candidates. Moreover, weights are shared between all antecedent candidates, so the inputs of our internal softmax layer share dependencies on the same weight variables. When computing derivatives with backpropagation, these shared dependen- cies must be taken into account. In particular, the outputs yi ofthe antecedent resolution layer are the result of a softmax applied to functions of some shared variables q: yi=∑kexepxp fi( fkq()q) (1) The derivatives of any yi with respect to q, which can be any of the weights in the anaphora resolution subnetwork, have dependencies on the derivatives of the other softmax inputs with respect to q: ∂∂yqi= yi ∂ f∂i(qq)−∑kyk∂ f∂k(qq)! (2) This makes the implementation of backpropagation for this part of the network somewhat more complicated, but in the case of our networks, it has no major impact on training time. Experimental results for this network are shown in Table 4. Compared with Table 3, we note that the overall accuracy is only very slightly lower for TED, and for the news commentaries it is actually better. When it comes to F-scores, the performance for elles improves by a small amount, while the effect on the other classes is a bit more mixed. Even where it gets worse, the differences are not dramatic considering that we eliminated a very knowledge-rich resource from the training process. This demonstrates that it is possible, in our classification task, to obtain good results without using any data manually annotated for anaphora and to rely entirely on unsupervised latent anaphora resolution. 7 Further Improvements The results presented in the preceding section represent a clear improvement over the ME classifiers in Table 2, even though the overall accuracy increased only slightly. Not only does our neural network classifier achieve better results on the classification task at hand without requiring an anaphora resolution classifier trained on manually annotated data, but it performs clearly better for the feminine categories that reflect minority choices requiring knowledge about the antecedents. Nevertheless, the performance is still not entirely satisfactory. By subjecting the output of our classifier on a development set to a manual error analysis, we found that a fairly large number oferrors belong to two error types: On the one hand, the preprocessing pipeline used to identify antecedent candidates does not always include the correct antecedent in the set presented to the neural network. Whenever this occurs, it is obvious that the classifier cannot possibly find 387 the correct antecedent. Out of 76 examples of the category elles that had been mistakenly predicted as ils, we found that 43 suffered from this problem. In other classes, the problem seems to be somewhat less common, but it still exists. On the other hand, in many cases (23 out of 76 for the category mentioned before) the anaphora resolution subnetwork does identify an antecedent manually recognised to belong to the right gender/number group, but still predicts an incorrect pronoun. This may indicate that the network has difficulties learning a correct gender/number representation for all words in the vocabulary. 7.1 Relaxing Markable Extraction The pipeline we use to extract potential antecedent candidates is borrowed from the BART anaphora resolution toolkit. BART uses a syntactic parser to identify noun phrases as markables. When extracting antecedent candidates for coreference prediction, it starts by considering a window consisting of the sentence in which the anaphoric pronoun is located and the two immediately preceding sentences. Markables in this window are checked for morphological compatibility in terms of gender and number with the anaphoric pronoun, and only compatible markables are extracted as antecedent candidates. If no compatible markables are found in the initial window, the window is successively enlarged one sentence at a time until at least one suitable markable is found. Our error analysis shows that this procedure misses some relevant markables both because the initial two-sentence extraction window is too small and because the morphological compatibility check incorrectly filters away some markables that should have been considered as candidates. By contrast, the extraction procedure does extract quite a number of first and second person noun phrases (I, we, you and their oblique forms) in the TED talks which are extremely unlikely to be the antecedent of a later occurrence of he, she, it or they. As a first step, we therefore adjust the extraction criteria to our task by increasing the initial extraction window to five sentences, excluding first and second person markables and removing the morphological compatibility requirement. The compatibility check is still used to control expansion of the extraction window, but it is no longer applied to filter the extracted markables. This increases the accuracy to 0.701 for TED and 0.602 for the news TED (Accuracy: 0.696) P R ce 0.618 0.722 elle 0.754 0.548 elles 0.737 0.340 il 0.718 0.629 ils 0.652 0.916 OTHER 0.741 0.682 F 0.666 0.635 0.465 0.670 0.761 0.711 News commentary (Accuracy: 0.597) ce elle elles il ils OTHER P 0.419 0.547 0.539 0.623 0.596 0.614 R 0.368 0.460 0.135 0.719 0.783 0.544 F 0.392 0.500 0.215 0.667 0.677 0.577 Table 4: Neural network classifier with latent anaphora resolution TED (Accuracy: 0.713) ce elle P 0.61 1 0.749 R 0.723 0.596 F 0.662 0.664 elles 0.602 0.616 il 0.733 0.638 ils 0.710 0.884 OTHER 0.760 0.704 News commentary (Accuracy: 0.626) ce elle elles il ils OTHER P 0.492 0.526 0.547 0.599 0.671 0.681 Table 5: Final classifier R 0.324 0.439 0.558 0.757 0.878 0.526 0.609 0.682 0.788 0.731 F 0.391 0.478 0.552 0.669 0.761 0.594 results commentaries, while the performance for elles im- proves to F-scores of 0.531 (TED; P 0.690, R 0.432) and 0.304 (News commentaries; P 0.444, R 0.231), respectively. Note that these and all the following results are not directly comparable to the ME baseline results in Table 2, since they include modifications and improvements to the training data extraction procedure that might possibly lead to benefits in the ME setting as well. 7.2 Adding Lexicon Knowledge In order to make it easier for the classifier to identify the gender and number properties of infrequent words, we extend the word vectors with features indicating possible morphological features for each word. In early experiments with ME classifiers, we found that our attempts to do proper gender and number tagging in French text did not improve classification performance noticeably, presumably because the annotation was too noisy. In more recent experiments, we just add features indicating all possible morphological interpretations of each word, rather than trying to disambiguate them. To do this, we look up the morphological annotations of the French words in the Lefff dictionary (Sagot et al., 2006) and intro- 388 duce a set of new binary features to indicate whether a particular reading of a word occurs in that dictionary. These features are then added to the one-hot representation of the antecedent words. Doing so improves the classifier accuracy to 0.71 1 (TED) and 0.604 (News commentaries), while the F-scores for elles reach 0.589 (TED; P 0.649, R 0.539) and 0.500 (News commentaries; P 0.545, R 0.462), respectively. 7.3 More Anaphoric Link Features Even though the modified antecedent candidate extraction with its larger context window and without the morphological filter results in better performance on both test sets, additional error analysis reveals that the classifiers has greater problems identifying the correct markable in this setting. One reason for this may be that the baseline anaphoric link feature set described above (Section 6) only includes two very rough binary distance features which indicate whether or not the anaphora and the antecedent candidate occur in the same or in immediately adjacent sentences. With the larger context window, this may be too unspecific. In our final experiment, we there- fore enable some additional features which are available in BART, but disabled in the baseline system: Distance in number of markables Distance in number of sentences Sentence distance, log-transformed Distance in number of words Part of speech of head word Most of these encode the distance between the anaphora and the antecedent candidate in more precise ways. Complete results for this final system are presented in Table 5. Including these additional features leads to another slight increase in accuracy for both corpora, with similar or increased classifier F-scores for most classes except elle in the news commentary condition. In particular, we should like to point out the performance of our benchmark classifier for elles, which suffered from extremely low recall in the first classifiers and approaches the performance ofthe other classes, with nearly balanced precision and recall, in this final system. Since elles is a low-frequency class and cannot be reliably predicted using source context alone, we interpret this as evidence that our final neural network classifier has incorporated some relevant knowledge about pronominal anaphora that the baseline ME clas– – – – – sifier and earlier versions of our network have no access to. This is particularly remarkable because no data manually annotated for coreference was used for training. 8 Related work Even though it was recognised years ago that the information contained in parallel corpora may provide valuable information for the improvement of anaphora resolution systems, there have not been many attempts to cash in on this insight. Mitkov and Barbu (2003) exploit parallel data in English and French to improve pronominal anaphora resolution by combining anaphora resolvers for the individual languages with handwritten rules to resolve conflicts between the output of the language-specific resolvers. Veselovská et al. (2012) apply a similar strategy to English-Czech data to resolve different uses of the pronoun it. Other work has used word alignments to project coreference annotations from one language to another with a view to training anaphora resolvers in the target language (Postolache et al., 2006; de Souza and Or˘ asan, 2011). Rahman and Ng (2012) instead use machine translation to translate their test 389 data into a language for which they have an anaphora resolver and then project the annotations back to the original language. Completely unsupervised monolingual anaphora resolution has been approached using, e. g., Markov logic (Poon and Domingos, 2008) and the Expectation-Maximisation algorithm (Cherry and Bergsma, 2005; Charniak and Elsner, 2009). To the best of our knowledge, the direct application of machine learning techniques to parallel data in a task related to anaphora resolution is novel in our work. Neural networks and deep learning techniques have recently gained some popularity in natural language processing. They have been applied to tasks such as language modelling (Bengio et al., 2003; Schwenk, 2007), translation modelling in statistical machine translation (Le et al., 2012), but also part-ofspeech tagging, chunking, named entity recognition and semantic role labelling (Collobert et al., 2011). In tasks related to anaphora resolution, standard feedforward neural networks have been tested as a classifier in an anaphora resolution system (Stuckardt, 2007), but the network design presented in our work is novel. 9 Conclusion In this paper, we have introduced cross-lingual pronoun prediction as an independent natural language processing task. Even though it is not an end-to-end task, pronoun prediction is interesting for several reasons. It is related to the problem of pronoun translation in SMT, a currently unsolved problem that has been addressed in a number of recent research publications (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) without reaching a majorbreakthrough. In this work, we have shown that pronoun prediction can be effectively modelled in a neural network architecture with relatively simple features. More importantly, we have demonstrated that the task can be exploited to train a classifier with a latent representation of anaphoric links. With parallel text as its only supervision this classifier achieves a level of performance that is similar to, if not better than, that of a classifier using a regular anaphora resolution system trained with manually annotated data. References Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1 155. Samuel Broscheit, Massimo Poesio, Simone Paolo Ponzetto, Kepa Joseba Rodriguez, Lorenza Romano, Olga Uryupina, Yannick Versley, and Roberto Zanoli. 2010. BART: A multilingual anaphora resolution system. In Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2010), Uppsala, Sweden, 15–16 July 2010. Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Associationfor Machine Translation (EAMT), pages 261–268, Trento, Italy. Eugene Charniak and Micha Elsner. 2009. EM works for pronoun anaphora resolution. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 148–156, Athens, Greece. Colin Cherry and Shane Bergsma. 2005. An Expectation Maximization approach to pronoun resolution. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 88– 95, Ann Arbor, Michigan. Michael Collins. 1999. Head-Driven Statistical Models forNatural Language Parsing. Ph.D. thesis, University of Pennsylvania. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal ofMachine Learning Research, 12:2461–2505. José de Souza and Constantin Or˘ asan. 2011. Can projected chains in parallel corpora help coreference resolution? In Iris Hendrickx, Sobha Lalitha Devi, António Branco, and Ruslan Mitkov, editors, Anaphora Processing and Applications, volume 7099 of Lecture Notes in Computer Science, pages 59–69. Springer, Berlin. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Associationfor Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Dan Klein and Christopher D. Manning. 390 2003. Accu- rate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Associationfor Computational Linguistics, pages 423–430, Sapporo, Japan. Hai-Son Le, Alexandre Allauzen, and François Yvon. 2012. Continuous space translation models with neural networks. In Proceedings ofthe 2012 Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 39–48, Montréal, Canada. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Ruslan Mitkov and Catalina Barbu. 2003. Using bilingual corpora to improve pronoun resolution. Languages in Contrast, 4(2):201–21 1. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29: 19–51. Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 650– 659, Honolulu, Hawaii. Oana Postolache, Dan Cristea, and Constantin Or˘ asan. 2006. Transferring coreference chains through word alignment. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC-2006), pages 889–892, Genoa. Altaf Rahman and Vincent Ng. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the 2012 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 720– 730, Montréal, Canada. Benoît Sagot, Lionel Clément, Éric Villemonte de La Clergerie, and Pierre Boullier. 2006. The Lefff 2 syntactic lexicon for French: architecture, acquisition, use. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC2006), pages 1348–1351, Genoa. Holger Schwenk. 2007. Continuous space language models. Computer Speech and Language, 21(3):492–5 18. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4):521–544. Roland Stuckardt. 2007. Applying backpropagation networks to anaphor resolution. In António Branco, editor, Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Collo- 2007, number 4410 in Lecture Notes in Artificial Intelligence, pages 107–124, Berlin. Kate ˇrina Veselovská, Ngu.y Giang Linh, and Michal Novák. 2012. Using Czech-English parallel corpora in quium, DAARC automatic identification of it. In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pages 112–120, Istanbul, Turkey. 391

5 0.57492894 65 emnlp-2013-Document Summarization via Guided Sentence Compression

Author: Chen Li ; Fei Liu ; Fuliang Weng ; Yang Liu

Abstract: Joint compression and summarization has been used recently to generate high quality summaries. However, such word-based joint optimization is computationally expensive. In this paper we adopt the ‘sentence compression + sentence selection’ pipeline approach for compressive summarization, but propose to perform summary guided compression, rather than generic sentence-based compression. To create an annotated corpus, the human annotators were asked to compress sentences while explicitly given the important summary words in the sentences. Using this corpus, we train a supervised sentence compression model using a set of word-, syntax-, and documentlevel features. During summarization, we use multiple compressed sentences in the integer linear programming framework to select . salient summary sentences. Our results on the TAC 2008 and 2011 summarization data sets show that by incorporating the guided sentence compression model, our summarization system can yield significant performance gain as compared to the state-of-the-art.

6 0.56656444 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

7 0.54164225 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

8 0.53988123 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

9 0.53460991 45 emnlp-2013-Chinese Zero Pronoun Resolution: Some Recent Advances

10 0.53411216 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

11 0.52855241 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

12 0.52653849 118 emnlp-2013-Learning Biological Processes with Global Constraints

13 0.51778132 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

14 0.51754659 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

15 0.51076514 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks

16 0.50038791 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

17 0.49652433 156 emnlp-2013-Recurrent Continuous Translation Models

18 0.49541485 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations

19 0.49536601 160 emnlp-2013-Relational Inference for Wikification

20 0.48885086 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction