emnlp emnlp2013 emnlp2013-80 knowledge-graph by maker-knowledge-mining

80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

Source: pdf

Author: Fang Kong ; Hwee Tou Ng

Abstract: Coreference resolution plays a critical role in discourse analysis. This paper focuses on exploiting zero pronouns to improve Chinese coreference resolution. In particular, a simplified semantic role labeling framework is proposed to identify clauses and to detect zero pronouns effectively, and two effective methods (refining syntactic parser and refining learning example generation) are employed to exploit zero pronouns for Chinese coreference resolution. Evaluation on the CoNLL-2012 shared task data set shows that zero pronouns can significantly improve Chinese coreference resolution.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper focuses on exploiting zero pronouns to improve Chinese coreference resolution. [sent-4, score-1.385]

2 Evaluation on the CoNLL-2012 shared task data set shows that zero pronouns can significantly improve Chinese coreference resolution. [sent-6, score-1.401]

3 1 Introduction As one of the most important tasks in discourse analysis, coreference resolution aims to link a given mention (i. [sent-7, score-0.778]

4 Over the last decade, various machine learning techniques have been applied to coreference resolution and have performed reasonably well (Soon et al. [sent-10, score-0.602]

5 In this paper, we focus on exploiting one of the key characteristics of Chinese text, zero pronouns (ZPs), to improve Chinese 278 Hwee Tou Ng Department of Computer Science National University of Singapore 13 Computing Drive Singapore 117417 nght @ comp . [sent-15, score-0.994]

6 In particular, a simplified semantic role labeling (SRL) framework is proposed to identify Chinese clauses and to detect zero pronouns effectively, and two effective methods are employed to exploit zero pronouns for Chinese coreference resolution. [sent-19, score-2.485]

7 Our work is novel in that it is the first work that incorporates the use of zero pronouns to significantly improve Chinese coreference resolution The rest of this paper is organized as follows. [sent-21, score-1.571]

8 Section 2 describes our baseline Chinese coreference resolution system. [sent-22, score-0.602]

9 Section 3 motivates how the detection of zero pronouns can improve Chinese coreference resolution, using an illustrating example. [sent-23, score-1.386]

10 Section 5 proposes two methods to exploit zero pronouns to improve Chinese coreference resolution, based on a corpus study and preliminary experiments. [sent-25, score-1.36]

11 Our Chinese coreference resolution system also contains these two components. [sent-29, score-0.622]

12 hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 2t7ic8s–28 , a mention is anaphoric or not, and then employ an independently-trained coreference resolution system to resolve those mentions which are classified as anaphoric. [sent-32, score-0.952]

13 The lack of gender and number makes both anaphoricity determination and coreference resolution in Chinese more difficult. [sent-33, score-0.853]

14 1 Anaphoricity Determination Since only the mentions that take part in coreference chains are annotated in the CoNLL-2012 shared task data set, we first generate a high-recall, lowprecision mention extraction module to extract as many mentions as possible. [sent-35, score-0.839]

15 2 Coreference Resolution Our Chinese coreference resolution system adopts the same learning-based model and the same set of 12 features as Soon et al. [sent-46, score-0.622]

16 , a large proportion of personal pronouns and the organization of a text into several and preparing for dealing with zero pronouns, we add some features shown in Table parts1) 2. [sent-50, score-0.969]

17 For the feature ANPronounRanking, the relative ranking of a given pronoun is based on its semantic role and surface position, and we assign the highest rank to zero pronouns, similar to Kong et al. [sent-53, score-0.711]

18 10712897 Table 4: Performance of our Chinese coreference resolution system on the CoNLL-2012 test set 2. [sent-73, score-0.622]

19 The SVM-light toolkit (Joachims, 1999) with radial basis kernel and default learning parameters is employed in both anaphoricity determination and coreference resolution. [sent-75, score-0.705]

20 Table 4 reports the performance of our Chinese coreference resolution system on the CoNLL-2012 test set under three different experimental settings: with automatic mentions (AM), with gold mention FeatureDescription NPType Type of the current mention (pronoun, demonstrative, proper NP). [sent-80, score-1.156]

21 Table 1: Features employed in our anaphoricity determination system FeatureDescription AN/CAPronounType Whether the anaphor or the antecedent candidate is a zero pronoun, first person, second person, third person, neutral pronoun, or others. [sent-94, score-0.989]

22 In our coreference resolution system, a zero pronoun is viewed as a kind of special pro- AN/CAGrammaticalRolenWohuent. [sent-95, score-1.295]

23 ANPronounRankingWhether the anaphor is a pronoun and is ranked highest among the pronouns (including zero pronouns) of the sentence. [sent-99, score-1.267]

24 Table 2: Additional features employed in our Chinese coreference resolution system 280 boundaries (GMB), and with gold mentions From the results, we find that: • (GM). [sent-104, score-0.919]

25 Thus our coreference resolution system can benefit much from using gold mention boundaries (especially the recall). [sent-116, score-0.92]

26 3 Motivation In order to analyze the impact of zero pronouns on Chinese coreference resolution, we first use the released OntoNotes v5. [sent-120, score-1.379]

27 Statistics show that anaphoric zero pronouns account for 10. [sent-124, score-1.027]

28 The experimental results of our Chinese coreference resolution system (i. [sent-127, score-0.622]

29 , the baseline) show that using both gold mention boundaries and gold mentions significantly 281 improves system performance, especially for recall, largely due to improved parser performance. [sent-129, score-0.578]

30 We then analyze the impact of zero pronouns on Chinese syntactic parsing. [sent-130, score-0.988]

31 As a preliminary exploration, we integrate Chinese zero pronouns into the Berkeley parser (Petrov et al. [sent-131, score-1.012]

32 , 2006), experimenting with gold-standard or automatically determined zero pronouns kept or stripped off (using gold-standard word segmentation provided in the CoNLL-2012 data). [sent-132, score-0.969]

33 Using automatically determined zero pronouns by our zero pronoun detector to be introduced in Section 4, parsing performance also improves by 1. [sent-135, score-1.725]

34 In order to illustrate the impact of zero pronouns on parsing performance, consider the following example:3 Example (1): 将来我们有一一一个个个重重重建建建计计计划划划 #分公景点园成七个区域，。 #带来多一些的。 . [sent-137, score-0.988]

35 ) Without considering zero pronouns, the parse tree of the second sentence output by the Berkeley parser is shown in Figure 1. [sent-147, score-0.646]

36 Prior to parsing, using our zero pronoun detector to be introduced in Section 4, the presence of zero pronouns (denoted by #) can be detected. [sent-148, score-1.701]

37 Figure 2 3In this paper, zero pronouns are denoted by “#” and mentions in the same coreference chain are shown in bold for all examples. [sent-149, score-1.51]

38 MDMUCBCUBEDCEAFAvg Table 5: Performance (F-measure) of the three best Chinese coreference resolution systems on the CoNLL-2012 test set shows the new parse tree, which includes the detected zero pronouns, output by the Berkeley parser on the same sentence. [sent-150, score-1.246]

39 Comparing these two parse trees, we can see that the detected zero pronouns contribute to better division of clauses and improved parsing performance, which in turn leads to improved Chinese coreference resolution. [sent-151, score-1.526]

40 Detecting the presence of zero pronouns also helps to improve local salience modeling, leading to improved Chinese coreference resolution. [sent-152, score-1.405]

41 , the first and second zero pronouns of Example (1)), but can also be scattered across multiple sentences (e. [sent-157, score-0.969]

42 , the first and third zero pronouns of Example (1)). [sent-159, score-0.969]

43 4 Detection of zero pronouns improves local salience modeling, and leads to the correct identification of all the noun phrases of the coreference chain in Example (1). [sent-161, score-1.465]

44 Among empty elements, type *pro*, namely zero pronoun, is either used for dropped subjects or objects, which can be recovered from the context (anaphoric), or it is of little interest for the reader or listener to know (non-anaphoric). [sent-165, score-0.569]

45 Thus, zero pronouns are very important in bridging the information gap in a Chinese text. [sent-167, score-0.969]

46 In this section, we will introduce our zero pronoun detector. [sent-168, score-0.693]

47 In Chinese, a zero pronoun always occurs just before a predicate phrase node (e. [sent-169, score-0.768]

48 We carry out zero pronoun detection for every predicate phrase subtree in an iterative manner from a parse tree, i. [sent-176, score-0.896]

49 , determining whether there is a zero pronoun before the given predicate phrase subIP ? [sent-178, score-0.768]

50 HCHLP 多一 M 些 Figure 1: The parse tree without considering zero pronouns tree. [sent-223, score-1.091]

51 Viewing the position before the given predicate phrase subtree as a zero pronoun candidate, we can perform zero pronoun detection using a machine learning approach. [sent-224, score-1.517]

52 During training, if a zero pronoun candidate has a counterpart in the same position in the annotated training corpus (either anaphoric or non-anaphoric), a positive example is generated. [sent-225, score-0.781]

53 During testing, each zero pronoun candidate is presented to the zero pronoun detector to determine whether it is a zero pronoun. [sent-227, score-1.936]

54 The features that are employed to detect zero pronouns mainly model the context of the clause itself, the left and right siblings, and the path of the clause to the root node. [sent-228, score-1.295]

55 1 Results and Analysis We evaluate our zero pronoun detector using gold parse trees and automatic parse trees produced by the Berkeley parser. [sent-231, score-1.031]

56 95 Table 7: Performance of zero pronoun detection on the test set using gold and automatic parse trees find that the performance of our zero pronoun detector drops about 12% in F-measure when using automatic parse trees, compared to using gold parse trees. [sent-243, score-1.863]

57 That is, the performance of zero pronoun detection also depends on the performance of the syntactic parser. [sent-244, score-0.719]

58 5 Exploiting Zero Pronouns to Improve Chinese Coreference Resolution In this section, we will propose two methods, refining the syntactic parser and refining learning example generation, to exploit zero pronouns to improve Chinese coreference resolution. [sent-245, score-1.561]

59 HH AD CD CLP 多一 M 些 Figure 2: The parse tree with the detected zero pronouns 。 5. [sent-285, score-1.117]

60 1 Refining the Syntactic Parser Similar to our preliminary experiments, we retrain the Berkeley parser with explicit, automatically detected zero pronouns in the training set and parse the test set with explicit, automatically detected zero pronouns using the retrained model. [sent-286, score-2.172]

61 In both anaphoricity determination and coreference resolution, the output results of the retrained parser are employed to generate all features. [sent-287, score-0.751]

62 2 Refining Learning Example Generation In order to model the salience of all entities, we regard all zero pronouns as a special kind of NPs when generating the learning examples. [sent-289, score-1.014]

63 Considering the modest performance of our anaphoricity determination module, we do not determine the anaphoricity of zero pronouns. [sent-290, score-0.879]

64 Instead, in the coreference resolution stage, all zero pronouns will be considered during learning example generation (including both training and test example generation). [sent-291, score-1.593]

65 For example, consider a coreference chain A1A2-Z0-A3-A4 containing one zero pronoun found in an annotated training document. [sent-292, score-1.12]

66 , A2-Z0 and Z0-A3) are associated with a zero pronoun, which can act as both an anaphor and an antecedent. [sent-299, score-0.567]

67 , Z0A3, we find any noun phrase and zero pronoun occurring between the anaphor A3 and the antecedent Z0, and pair each of them with A3 to form a negative example. [sent-302, score-0.886]

68 Similarly, test examples can be generated except that only the preceding mentions and zero pronouns in the current and previous two sentences will be paired with an anaphor. [sent-303, score-1.108]

69 Incorporating zero pronouns models salience of all entities more accurately. [sent-304, score-1.014]

70 The ratio of positive to negative examples is also less skewed as a result of considering zero pronouns the ratio changes from 1:7. [sent-305, score-0.991]

71 , zero pronouns) are considered dur- ing coreference resolution for Chinese, they are not FeatureDescription ClauseClass Whether the given clause is a terminal clause or non-terminal clause. [sent-311, score-1.343]

72 Table 6: Features employed to detect zero pronouns used in the CoNLL-2012 shared task (i. [sent-321, score-1.076]

73 , in the gold evaluation keys, all the links formed by zero pronouns are removed). [sent-323, score-1.065]

74 2, during training and testing, all links associated with zero pronouns will be considered in our coreference resolution system. [sent-325, score-1.588]

75 That is, we do not distinguish zero pronoun resolution from traditional coreference resolution, and only view zero pronouns as special pronouns. [sent-326, score-2.264]

76 After generating all the links, zero pronouns are included in coreference chains. [sent-327, score-1.36]

77 For every coreference chain, all zero pronouns will be removed before evaluation. [sent-328, score-1.36]

78 Table 8 lists the coreference resolution performance incorporating automatically detected zero pronouns. [sent-333, score-1.155]

79 The results show that: • Using automatically detected zero pronouns aUcshiinegves a tboemtteart performance dun zdeerro oa lpl experimental settings. [sent-334, score-1.017]

80 Using gold mention boundaries, automatic zero pronouns contribute 1. [sent-339, score-1.223]

81 428 6839 Table 8: Performance of our Chinese coreference resolution system incorporating zero pronouns contribution of zero pronouns is only 0. [sent-350, score-2.584]

82 • Our system incorporating zero pronouns outperforms tmhe inthcreoerp boerastt systems i pnr tohneo CunosNL ouLt-2012 shared task when using automatic mentions or gold mention boundaries. [sent-353, score-1.405]

83 5 Table 9 presents the contribution of our two meth- ods of exploiting zero pronouns and the impact of gold-standard zero pronouns. [sent-355, score-1.494]

84 dW ehxilaem tphlee refined parser improves the recall of mention detection and coreference resolution, refined example generation contributes more to precision. [sent-358, score-0.728]

85 s s19 a% p fino mFa-nmceeas gaupre on 1t%he, MUC, BCUBED, and CEAF evaluation metric, respectively, between the coreference resolution system with gold-standard zero pronouns and without zero pronouns. [sent-363, score-2.072]

86 This suggests the usefulness of zero pronoun detection in Chinese coreference resolution. [sent-364, score-1.11]

87 Our proposed methods incorporating automatic zero pronouns m reetdhuocdes t ihnec performance gap by about half. [sent-365, score-0.993]

88 Discussion Although the evaluation of the CoNLL-2012 shared task does not consider zero pronouns, we also evaluate the performance of zero pronoun resolution on the development data set (i. [sent-367, score-1.426]

89 , extracting all the resolved coreference links containing zero pronouns, acting as anaphor or antecedent, to conduct the evaluation independently). [sent-369, score-0.975]

90 So viewing zero pronouns as a special kind of NP, zero pronouns can bridge salience and contribute to coreference resolution. [sent-373, score-2.391]

91 In Example (1), the zero pronouns occurring in the second sentence help to bridge the coreferential relation between the mention “这这这个个个计计计划划划/this plan” in the last sentence and the mention “一一一个个个重重重建建建计计计划划划/a reconstruction plan” in the first sentence. [sent-374, score-1.285]

92 There is less research on Chinese coreference resolution compared to English. [sent-380, score-0.602]

93 Although zero pronouns are prevalent in Chinese, there is relatively little work on this topic. [sent-381, score-0.969]

94 For Chinese zero pronoun resolution, representative work includes Converse (2006), Zhao and Ng (2007), and Kong and Zhou (2010). [sent-382, score-0.693]

95 7 Conclusion In this paper, we focus on exploiting one of the key characteristics of Chinese text, zero pronouns, to improve Chinese coreference resolution. [sent-390, score-0.897]

96 In particular, a simplified semantic role labeling framework is proposed to detect zero pronouns effectively, and two effective methods are employed to incorporate zero pronouns into Chinese coreference resolution. [sent-391, score-2.456]

97 To the best of our knowledge, this is the first attempt at incorporating zero pronouns into Chinese coreference resolution. [sent-393, score-1.384]

98 Stanford’s multi-pass sieve coreference resolution system at the CoNLL-201 1 shared task. [sent-478, score-0.663]

99 A machine learning approach to coreference resolution of noun phrases. [sent-498, score-0.602]

100 Identification and resolution of Chinese zero pronouns: A machine learning approach. [sent-522, score-0.692]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pronouns', 0.488), ('zero', 0.481), ('coreference', 0.391), ('pronoun', 0.212), ('resolution', 0.211), ('chinese', 0.186), ('mention', 0.158), ('anaphoricity', 0.147), ('clause', 0.13), ('mentions', 0.114), ('determination', 0.104), ('anaphor', 0.086), ('gold', 0.079), ('refining', 0.079), ('antecedent', 0.078), ('parse', 0.072), ('auto', 0.062), ('boundaries', 0.061), ('np', 0.059), ('anaphoric', 0.058), ('vp', 0.057), ('gmbacm', 0.057), ('hhhnp', 0.057), ('hhnp', 0.057), ('rpf', 0.057), ('ip', 0.052), ('detected', 0.048), ('predicate', 0.046), ('salience', 0.045), ('pro', 0.044), ('parser', 0.043), ('hchlp', 0.043), ('employed', 0.043), ('kong', 0.042), ('vv', 0.042), ('shared', 0.041), ('detector', 0.039), ('singapore', 0.038), ('trees', 0.038), ('subjects', 0.037), ('bcubed', 0.037), ('ceaf', 0.037), ('chain', 0.036), ('berkeley', 0.034), ('featuredescription', 0.034), ('refined', 0.032), ('ng', 0.032), ('empty', 0.031), ('subtree', 0.03), ('candidate', 0.03), ('chung', 0.03), ('phrase', 0.029), ('soon', 0.029), ('clauses', 0.029), ('ambcam', 0.028), ('cveeunacratbfigoened', 0.028), ('hhh', 0.028), ('hhhhhhhhh', 0.028), ('fernandes', 0.028), ('reg', 0.028), ('muc', 0.028), ('tree', 0.028), ('fang', 0.027), ('hwee', 0.027), ('rp', 0.026), ('tou', 0.026), ('detection', 0.026), ('current', 0.025), ('gs', 0.025), ('exploiting', 0.025), ('simplified', 0.024), ('siblings', 0.024), ('improves', 0.024), ('incorporating', 0.024), ('anaphora', 0.023), ('nn', 0.023), ('detect', 0.023), ('retrained', 0.023), ('generation', 0.022), ('considering', 0.022), ('module', 0.021), ('pu', 0.021), ('zhong', 0.021), ('employing', 0.02), ('dropped', 0.02), ('srl', 0.02), ('radial', 0.02), ('subordinate', 0.02), ('conll', 0.02), ('system', 0.02), ('labeling', 0.019), ('impact', 0.019), ('chen', 0.019), ('ad', 0.019), ('discourse', 0.018), ('role', 0.018), ('sibling', 0.018), ('drive', 0.018), ('contribute', 0.017), ('links', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

Author: Fang Kong ; Hwee Tou Ng

2 0.43547663 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

Author: Jonathan K. Kummerfeld ; Dan Klein

Abstract: Coreference resolution metrics quantify errors but do not analyze them. Here, we consider an automated method of categorizing errors in the output of a coreference system into intuitive underlying error types. Using this tool, we first compare the error distributions across a large set of systems, then analyze common errors across the top ten systems, empirically characterizing the major unsolved challenges of the coreference resolution task.

3 0.43323615 67 emnlp-2013-Easy Victories and Uphill Battles in Coreference Resolution

Author: Greg Durrett ; Dan Klein

Abstract: Classical coreference systems encode various syntactic, discourse, and semantic phenomena explicitly, using heterogenous features computed from hand-crafted heuristics. In contrast, we present a state-of-the-art coreference system that captures such phenomena implicitly, with a small number of homogeneous feature templates examining shallow properties of mentions. Surprisingly, our features are actually more effective than the corresponding hand-engineered ones at modeling these key linguistic phenomena, allowing us to win “easy victories” without crafted heuristics. These features are successful on syntax and discourse; however, they do not model semantic compatibility well, nor do we see gains from experiments with shallow semantic features from the literature, suggesting that this approach to semantics is an “uphill battle.” Nonetheless, our final system1 outperforms the Stanford system (Lee et al. (201 1), the winner of the CoNLL 2011 shared task) by 3.5% absolute on the CoNLL metric and outperforms the IMS system (Bj o¨rkelund and Farkas (2012), the best publicly available English coreference system) by 1.9% absolute.

4 0.28581616 1 emnlp-2013-A Constrained Latent Variable Model for Coreference Resolution

Author: Kai-Wei Chang ; Rajhans Samdani ; Dan Roth

Abstract: Coreference resolution is a well known clustering task in Natural Language Processing. In this paper, we describe the Latent Left Linking model (L3M), a novel, principled, and linguistically motivated latent structured prediction approach to coreference resolution. We show that L3M admits efficient inference and can be augmented with knowledge-based constraints; we also present a fast stochastic gradient based learning. Experiments on ACE and Ontonotes data show that L3M and its constrained version, CL3M, are more accurate than several state-of-the-art approaches as well as some structured prediction models proposed in the literature.

5 0.23744383 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

Author: Christian Hardmeier ; Jorg Tiedemann ; Joakim Nivre

Abstract: This paper addresses the task of predicting the correct French translations of third-person subject pronouns in English discourse, a problem that is relevant as a prerequisite for machine translation and that requires anaphora resolution. We present an approach based on neural networks that models anaphoric links as latent variables and show that its performance is competitive with that of a system with separate anaphora resolution while not requiring any coreference-annotated training data. This demonstrates that the information contained in parallel bitexts can successfully be used to acquire knowledge about pronominal anaphora in an unsupervised way. 1 Motivation When texts are translated from one language into another, the translation reconstructs the meaning or function of the source text with the means of the target language. Generally, this has the effect that the entities occurring in the translation and their mutual relations will display similar patterns as the entities in the source text. In particular, coreference patterns tend to be very similar in translations of a text, and this fact has been exploited with good results to project coreference annotations from one language into another by using word alignments (Postolache et al., 2006; Rahman and Ng, 2012). On the other hand, what is true in general need not be true for all types of linguistic elements. For instance, a substantial percentage ofthe English thirdperson subject pronouns he, she, it and they does not get realised as pronouns in French translations (Hardmeier, 2012). Moreover, it has been recognised 380 by various authors in the statistical machine translation (SMT) community (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) that pronoun translation is a difficult problem because, even when a pronoun does get translated as a pronoun, it may require choosing the correct word form based on agreement features that are not easily pre- dictable from the source text. The work presented in this paper investigates the problem of cross-lingual pronoun prediction for English-French. Given an English pronoun and its discourse context as well as a French translation of the same discourse and word alignments between the two languages, we attempt to predict the French word aligned to the English pronoun. As far as we know, this task has not been addressed in the literature before. In our opinion, it is interesting for several reasons. By studying pronoun prediction as a task in its own right, we hope to contribute towards a better understanding of pronoun translation with a longterm view to improving the performance of SMT systems. Moreover, we believe that this task can lead to interesting insights about anaphora resolution in a multi-lingual context. In particular, we show in this paper that the pronoun prediction task makes it possible to model the resolution of pronominal anaphora as a latent variable and opens up a way to solve a task relying on anaphora resolution without using any data annotated for anaphora. This is what we consider the main contribution of our present work. We start by modelling cross-lingual pronoun pre- diction as an independent machine learning task after doing anaphora resolution in the source language (English) using the BART software (Broscheit et al., 2010). We show that it is difficult to achieve satisfactory performance with standard maximumProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 3t8ic0s–391, The latest version released in March is equipped with ...It is sold at ... La dernière version lancée en mars est dotée de ... • est vendue ... • Figure 1: Task setup entropy classifiers especially for low-frequency pronouns such as the French feminine plural pronoun elles. We propose a neural network classifier that achieves better precision and recall and manages to make reasonable predictions for all pronoun categories in many cases. We then go on to extend our neural network architecture to include anaphoric links as latent variables. We demonstrate that our classifier, now with its own source language anaphora resolver, can be trained successfully with backpropagation. In this setup, we no longer use the machine learning component included in the external coreference resolution system (BART) to predict anaphoric links. Anaphora resolution is done by our neural network classifier and requires only some quantity of word-aligned parallel data for training, completely obviating the need for a coreference-annotated training set. 2 Task Setup The overall setup of the classification task we address in this paper is shown in Figure 1. We are given an English discourse containing a pronoun along with its French translation and word alignments between the two languages, which in our case were computed automatically using a standard SMT pipeline with GIZA++ (Och and Ney, 2003). We focus on the four English third-person subject pronouns he, she, it and they. The output of the classifier is a multinomial distribution over six classes: the four French subject pronouns il, elle, ils and elles, corresponding to masculine and feminine singular and plural, respectively; the impersonal pronoun ce/c’, which occurs in some very frequent constructions such as c’est (it is); and a sixth class OTHER, which indicates that none of these pronouns was used. In general, a pronoun may be aligned to multiple words; in this case, a training example is counted as a positive example for a class if the target word occurs among the words aligned to the pronoun, irrespective of the presence of other 381 word candidate training ex. verseiol ena0 0 1 01 10 0 0 .0510 .50 p 12= . 910. 5.9 050 Figure 2: Antecedent feature aggregation aligned tokens. This task setup resembles the problem that an SMT system would have to solve to make informed choices when translating pronouns, an aspect oftranslation neglected by most existing SMT systems. An important difference between the SMT setup and our own classifiers is that we use context from humanmade translations for prediction. This potentially makes the task both easier and more difficult; easier, because the context can be relied on to be correctly translated, and more difficult, because human translators frequently create less literal translations than an SMT system would. Integrating pronoun prediction into the translation process would require significant changes to the standard SMT decoding setup in order to take long-range dependencies in the target language into account, which is why we do not address this issue in our current work. In all the experiments presented in this paper, we used features from two different sources: Anaphora context features describe the source language pronoun and its immediate context consisting of three words to its left and three words to its right. They are encoded as vectors whose dimensionality is equal to the source vocabulary size with a single non-zero component indicating the word referred to (one-hot vectors). Antecedent features describe an antecedent candidate. Antecedent candidates are represented by the target language words aligned to the syntactic head of the source language markable TED News ce 16.3 % 6.4 % elle 7.1 % 10.1 % elles 3.0 % 3.9 % il 17.1 % 26.5 % ils 15.6 % 15.1 % OTHER 40.9 % 38.0 % – – Table 1: Distribution of classes in the training data noun phrase as identified by the Collins head finder (Collins, 1999). The different handling of anaphora context features and antecedent features is due to the fact that we always consider a constant number of context words on the source side, whereas the number of word vectors to be considered depends on the number of antecedent candidates and on the number of target words aligned to each antecedent. The encoding of the antecedent features is illustrated in Figure 2 for a training example with two antecedent candidates translated to elle and la version, respectively. The target words are represented as one-hot vectors with the dimensionality of the target language vocabulary. These vectors are then averaged to yield a single vector per antecedent candidate. Finally, the vectors of all candidates for a given training example are weighted by the probabilities assigned to them by the anaphora resolver (p1 and p2) and summed to yield a single vector per training example. 3 Data Sets and External Tools We run experiments with two different test sets. The TED data set consists of around 2.6 million tokens of lecture subtitles released in the WIT3 corpus (Cettolo et al., 2012). The WIT3 training data yields 71,052 examples, which were randomly partitioned into a training set of 63,228 examples and a test set of 7,824 examples. The official WIT3 development and test sets were not used in our experiments. The news-commentary data set is version 6 of the parallel news-commentary corpus released as a part of the WMT 2011training data1 . It contains around 2.8 million tokens ofnews text and yields 3 1,017 data points, 1http: //www. statmt .org/wmt11/translation-task. html (3 July 2013). 382 which were randomly split into 27,900 training examples and 3,117 test instances. The distribution of the classes in the two training sets is shown in Table 1. One thing to note is the dominance of the OTHER class, which pools together such different phenomena as translations with other pronouns not in our list (e. g., celui-ci) and translations with full noun phrases instead of pronouns. Splitting this group into more meaningful subcategories is not straightforward and must be left to future work. The feature setup of all our classifiers requires the detection of potential antecedents and the extraction of features pairing anaphoric pronouns with antecedent candidates. Some of our experiments also rely on an external anaphora resolution component. We use the open-source anaphora resolver BART to generate this information. BART (Broscheit et al., 2010) is an anaphora resolution toolkit consisting of a markable detection and feature extraction pipeline based on a variety of standard natural language processing (NLP) tools and a machine learning component to predict coreference links including both pronominal anaphora and noun-noun coreference. In our experiments, we always use BART’s markable detection and feature extraction machinery. Markable detection is based on the identification of noun phrases in constituency parses generated with the Stanford parser (Klein and Manning, 2003). The set of features extracted by BART is an extension of the widely used mention-pair anaphora resolution feature set by Soon et al. (2001) (see below, Section 6). In the experiments of the next two sections, we also use BART to predict anaphoric links for pronouns. The model used with BART is a maximum entropy ranker trained on the ACE02-npaper corpus (LDC2003T1 1). In order to obtain a probability distribution over antecedent candidates rather than onebest predictions or coreference sets, we modified the ranking component with which BART resolves pronouns to normalise and output the scores assigned by the ranker to all candidates instead of picking the highest-scoring candidate. 4 Baseline Classifiers In order to create a simple, but reasonable baseline for our task, we trained a maximum entropy (ME) ce TED (Accuracy: 0.685) P R 0.593 0.728 F 0.654 elle 0.798 0.523 elles 0.812 0.164 il 0.764 0.550 ils 0.632 0.949 OTHER 0.724 0.692 News commentary (Accuracy: 0.576) ce elle elles il ils OTHER P 0.508 0.530 0.538 0.600 0.593 0.564 R 0.294 0.312 0.062 0.666 0.769 0.609 Table 2: Maximum entropy classifier results 0.632 0.273 0.639 0.759 0.708 F 0.373 0.393 0.111 0.631 0.670 0.586 TED (Accuracy: 0.700) P R ce 0.634 0.747 elle 0.756 0.617 elles 0.679 0.319 il 0.719 0.591 ils 0.663 0.940 OTHER 0.743 0.678 News commentary (Accuracy: 0.576) F 0.686 0.679 0.434 0.649 0.778 0.709 P 0.477 0.498 F 0.400 0.444 ce elle R 0.344 0.401 elles il ils OTHER 0.565 0.655 0.570 0.567 0.116 0.626 0.834 0.573 0.193 0.640 0.677 0.570 Table 3: Neural network classifier with anaphoras resolved by BART classifier with the MegaM software package2 using the features described in the previous section and the anaphora links found by BART. Results are shown in Table 2. The baseline results show an overall higher accuracy for the TED data than for the newscommentary data. While the precision is above 50 % in all categories and considerably higher in some, recall varies widely. The pronoun elles is particularly interesting. This is the feminine plural of the personal pronoun, and it usually corresponds to the English pronoun they, which is not marked for gender. In French, elles is a marked choice which is only used if the antecedent exclusively refers to females or feminine-gendered objects. The presence of a single item with masculine grammatical gender in the antecedent will trigger the use of the masculine plural pronoun ils instead. This distinction cannot be predicted from the English source pronoun or its context; making correct predictions requires knowledge about the antecedent of the pronoun. Moreover, elles is a low-frequency pronoun. There are only 1,909 occurrences of this pro2http : //www . umiacs .umd .edu/~hal/megam/ (20 June 2013). 383 noun in the TED training data, and 1,077 in the newscommentary training set. Because of these special properties of the feminine plural class, we argue that the performance of a classifier on elles is a good indicator ofhow well it can represent relevant knowledge about pronominal anaphora as opposed to overfitting to source contexts or acting on prior assumptions about class frequencies. In accordance with the general linguistic preference for ils, the classifier tends to predict ils much more often than elles when encountering an English plural pronoun. This is reflected in the fact that elles has much lower recall than ils. Clearly, the classifier achieves a good part of its accuracy by making ma- jority choices without exploiting deeper knowledge about the antecedents of pronouns. An additional experiment with a subset of 27,900 training examples from the TED data confirms that the difference between TED and news commentaries is not just an effect of training data size, but that TED data is genuinely easier to predict than news commentaries. In the reduced data TED condition, the classifier achieves an accuracy of 0.673. Precision and recall of all classifiers are much closer to the Figure 3: Neural network for pronoun prediction large-data TED condition than to the news commentary experiments, except for elles, where we obtain an F-score of 0.072 (P 0.818, R 0.038), indicating that small training data size is a serious problem for this low-frequency class. 5 Neural Network Classifier In the previous section, we saw that a simple multiclass maximum entropy classifier, while making correct predictions for much of the data set, has a significant bias towards making majority class decisions, relying more on prior assumptions about the frequency distribution of the classes than on antecedent features when handling examples of less frequent classes. In order to create a system that can be trained to rely more explicitly on antecedent information, we created a neural network classifier for our task. The introduction of a hidden layer should enable the classifier to learn abstract concepts such as gender and number that are useful across multiple output categories, so that the performance of sparsely represented classes can benefit from the training examples of the more frequent classes. The overall structure of the network is shown in Figure 3. As inputs, the network takes the same features that were available to the baseline ME classifier, based on the source pronoun (P) with three words of context to its left (L1 to L3) and three words to its right (R1 to R3) as well as the words aligned to the syntactic head words of all possible antecedent candidates as found by BART (A). All words are 384 encoded as one-hot vectors whose dimensionality is equal to the vocabulary size. If multiple words are aligned to the syntactic head of an antecedent candidate, their word vectors are averaged with uniform weights. The resulting vectors for each antecedent are then averaged with weights defined by the posterior distribution of the anaphora resolver in BART (p1 to p3). The network has two hidden layers. The first layer (E) maps the input word vectors to a low-dimensional representation. In this layer, the embedding weights for all the source language vectors (the pronoun and its 6 context words) are tied, so if two words are the same, they are mapped to the same lowerdimensional embedding irrespective of their position relative to the pronoun. The embedding of the antecedent word vectors is independent, as these word vectors represent target language words. The entire embedding layer is then mapped to another hidden layer (H), which is in turn connected to a softmax output layer (S) with 6 outputs representing the classes ce, elle, elles, il, ils and OTHER. The non-linearity of both hidden layers is the logistic sigmoid function, f(x) = 1/(1 + e−x). In all experiments reported in this paper, the dimensionality of the source and target language word embeddings is 20, resulting in a total embedding layer size of 160, and the size of the last hidden layer is equal to 50. These sizes are fairly small. In experiments with larger layer sizes, we were able to obtain similar, but no better results. The neural network is trained with mini-batch stochastic gradient descent with backpropagated gradients using the RMSPROP algorithm with crossentropy as the objective function.3 In contrast to standard gradient descent, RMSPROP normalises the magnitude of the gradient components by dividing them by a root-mean-square moving average. We found this led to faster convergence. Other features of our training algorithm include the use of momentum to even out gradient oscillations, adaptive learning rates for each weight as well as adaptation of the global learning rate as a function of current training progress. The network is regularised with an ‘2 weight penalty. Good settings of the initial learning rate and the weight cost parameter (both around 0.001 in most experiments) were found by manual experi- mentation. Generally, we train our networks for 300 epochs, compute the validation error on a held-out set of some 10 % of the training data after each epoch and use the model that achieved the lowest validation error for testing. Since the source context features are very informative and it is comparatively more difficult to learn from the antecedents, the network sometimes had a tendency to overfit to the source features and disregard antecedent information. We found that this problem can be solved effectively by presenting a part of the training without any source features, forcing the network to learn from the information contained in the antecedents. In all experiments in this paper, we zero out all source features (input layers P, L1to L3 and R1 to R3) with a probability of 50 % in each training example. At test time, no information is zeroed out. Classification results with this network are shown in Table 3. We note that the accuracy has increased slightly for the TED test set and remains exactly the same for the news commentary corpus. However, a closer look on the results for individual classes reveals that the neural network makes better predictions for almost all classes. In terms of F-score, the only class that becomes slightly worse is the OTHER class for the news commentary corpus because of lower recall, indicating that the neural network classifier is less biased towards using the uninformative OTHER 3Our training procedure is greatly inspired by a series of online lectures held by Geoffrey Hinton in 2012 (https : //www . coursera. .org/course/neuralnets, 10 September 2013). 385 category. Recall for elle and elles increases considerably, but especially for elles it is still quite low. The increase in recall comes with some loss in precision, but the net effect on F-score is clearly positive. 6 Latent Anaphora Resolution Considering Figure 1 again, we note that the bilingual setting of our classification task adds some information not available to the monolingual anaphora resolver that can be helpful when determining the correct antecedent for a given pronoun. Knowing the gender of the translation of a pronoun limits the set of possible antecedents to those whose translation is morphologically compatible with the target language pronoun. We can exploit this fact to learn how to resolve anaphoric pronouns without requiring data with manually annotated anaphoric links. To achieve this, we extend our neural network with a component to predict the probability of each antecedent candidate to be the correct antecedent (Figure 4). The extended network is identical to the previous version except for the upper left part dealing with anaphoric link features. The only difference between the two networks is the fact that anaphora resolution is now performed by a part of our neural network itself instead of being done by an external module and provided to the classifier as an input. In this setup, we still use some parts of the BART toolkit to extract markables and compute features. However, we do not make use of the machine learning component in BART that makes the actual predictions. Since this is the only component trained on coreference-annotated data in a typical BART configuration, no coreference annotations are used anywhere in our system even though we continue to rely on the external anaphora resolver for preprocessing to avoid implementing our own markable and feature extractors and to make comparison easier. For each candidate markable identified by BART’s preprocessing pipeline, the anaphora resolution model receives as input a link feature vector (T) describing relevant aspects of the antecedent candidateanaphora pair. This feature vector is generated by the feature extraction machinery in BART and includes a standard feature set for coreference resolution partially based on work by Soon et al. (2001). We use the following feature extractors in BART, each of Figure 4: Neural network with latent anaphora resolution which can generate multiple features: Anaphora mention type Gender match Number match String match Alias feature (Soon et al., 2001) Appositive position feature (Soon et al., 2001) Semantic class (Soon et al., 2001) – – – – – – – Semantic class match Binary distance feature Antecedent is first mention in sentence Our baseline set of features was borrowed wholesale from a working coreference system and includes some features that are not relevant to the task at hand, e. g., features indicating that the anaphora is a pronoun, is not a named entity, etc. After removing all features that assume constant values in the training set when resolving antecedents for the set of pronouns we consider, we are left with a basic set of 37 anaphoric link features that are fed as inputs to our network. These features are exactly the same as those available to the anaphora resolution classifier in the BART system used in the previous section. Each training example for our network can have an arbitrary number of antecedent candidates, each of which is described by an antecedent word vector (A) and by an anaphoric link vector (T). The anaphoric link features are first mapped to a regular hidden layer with logistic sigmoid units (U). The activations of the hidden units are then mapped to a single value, which – – – 386 functions as an element in a softmax layer over all an- tecedent candidates (V). This softmax layer assigns a probability to each antecedent candidate, which we then use to compute a weighted average over the antecedent word vector, replacing the probabilities pi in Figures 2 and 3. At training time, the network’s anaphora resolution component is trained in exactly the same way as the rest of the network. The error signal from the embedding layer is backpropagated both to the weight matrix defining the antecedent word embedding and to the anaphora resolution subnetwork. Note that the number of weights in the network is the same for all training examples even though the number of antecedent candidates varies because all weights related to antecedent word features and anaphoric link features are shared between all antecedent candidates. One slightly uncommon feature of our neural network is that it contains an internal softmax layer to generate normalised probabilities over all possible antecedent candidates. Moreover, weights are shared between all antecedent candidates, so the inputs of our internal softmax layer share dependencies on the same weight variables. When computing derivatives with backpropagation, these shared dependen- cies must be taken into account. In particular, the outputs yi ofthe antecedent resolution layer are the result of a softmax applied to functions of some shared variables q: yi=∑kexepxp fi( fkq()q) (1) The derivatives of any yi with respect to q, which can be any of the weights in the anaphora resolution subnetwork, have dependencies on the derivatives of the other softmax inputs with respect to q: ∂∂yqi= yi ∂ f∂i(qq)−∑kyk∂ f∂k(qq)! (2) This makes the implementation of backpropagation for this part of the network somewhat more complicated, but in the case of our networks, it has no major impact on training time. Experimental results for this network are shown in Table 4. Compared with Table 3, we note that the overall accuracy is only very slightly lower for TED, and for the news commentaries it is actually better. When it comes to F-scores, the performance for elles improves by a small amount, while the effect on the other classes is a bit more mixed. Even where it gets worse, the differences are not dramatic considering that we eliminated a very knowledge-rich resource from the training process. This demonstrates that it is possible, in our classification task, to obtain good results without using any data manually annotated for anaphora and to rely entirely on unsupervised latent anaphora resolution. 7 Further Improvements The results presented in the preceding section represent a clear improvement over the ME classifiers in Table 2, even though the overall accuracy increased only slightly. Not only does our neural network classifier achieve better results on the classification task at hand without requiring an anaphora resolution classifier trained on manually annotated data, but it performs clearly better for the feminine categories that reflect minority choices requiring knowledge about the antecedents. Nevertheless, the performance is still not entirely satisfactory. By subjecting the output of our classifier on a development set to a manual error analysis, we found that a fairly large number oferrors belong to two error types: On the one hand, the preprocessing pipeline used to identify antecedent candidates does not always include the correct antecedent in the set presented to the neural network. Whenever this occurs, it is obvious that the classifier cannot possibly find 387 the correct antecedent. Out of 76 examples of the category elles that had been mistakenly predicted as ils, we found that 43 suffered from this problem. In other classes, the problem seems to be somewhat less common, but it still exists. On the other hand, in many cases (23 out of 76 for the category mentioned before) the anaphora resolution subnetwork does identify an antecedent manually recognised to belong to the right gender/number group, but still predicts an incorrect pronoun. This may indicate that the network has difficulties learning a correct gender/number representation for all words in the vocabulary. 7.1 Relaxing Markable Extraction The pipeline we use to extract potential antecedent candidates is borrowed from the BART anaphora resolution toolkit. BART uses a syntactic parser to identify noun phrases as markables. When extracting antecedent candidates for coreference prediction, it starts by considering a window consisting of the sentence in which the anaphoric pronoun is located and the two immediately preceding sentences. Markables in this window are checked for morphological compatibility in terms of gender and number with the anaphoric pronoun, and only compatible markables are extracted as antecedent candidates. If no compatible markables are found in the initial window, the window is successively enlarged one sentence at a time until at least one suitable markable is found. Our error analysis shows that this procedure misses some relevant markables both because the initial two-sentence extraction window is too small and because the morphological compatibility check incorrectly filters away some markables that should have been considered as candidates. By contrast, the extraction procedure does extract quite a number of first and second person noun phrases (I, we, you and their oblique forms) in the TED talks which are extremely unlikely to be the antecedent of a later occurrence of he, she, it or they. As a first step, we therefore adjust the extraction criteria to our task by increasing the initial extraction window to five sentences, excluding first and second person markables and removing the morphological compatibility requirement. The compatibility check is still used to control expansion of the extraction window, but it is no longer applied to filter the extracted markables. This increases the accuracy to 0.701 for TED and 0.602 for the news TED (Accuracy: 0.696) P R ce 0.618 0.722 elle 0.754 0.548 elles 0.737 0.340 il 0.718 0.629 ils 0.652 0.916 OTHER 0.741 0.682 F 0.666 0.635 0.465 0.670 0.761 0.711 News commentary (Accuracy: 0.597) ce elle elles il ils OTHER P 0.419 0.547 0.539 0.623 0.596 0.614 R 0.368 0.460 0.135 0.719 0.783 0.544 F 0.392 0.500 0.215 0.667 0.677 0.577 Table 4: Neural network classifier with latent anaphora resolution TED (Accuracy: 0.713) ce elle P 0.61 1 0.749 R 0.723 0.596 F 0.662 0.664 elles 0.602 0.616 il 0.733 0.638 ils 0.710 0.884 OTHER 0.760 0.704 News commentary (Accuracy: 0.626) ce elle elles il ils OTHER P 0.492 0.526 0.547 0.599 0.671 0.681 Table 5: Final classifier R 0.324 0.439 0.558 0.757 0.878 0.526 0.609 0.682 0.788 0.731 F 0.391 0.478 0.552 0.669 0.761 0.594 results commentaries, while the performance for elles im- proves to F-scores of 0.531 (TED; P 0.690, R 0.432) and 0.304 (News commentaries; P 0.444, R 0.231), respectively. Note that these and all the following results are not directly comparable to the ME baseline results in Table 2, since they include modifications and improvements to the training data extraction procedure that might possibly lead to benefits in the ME setting as well. 7.2 Adding Lexicon Knowledge In order to make it easier for the classifier to identify the gender and number properties of infrequent words, we extend the word vectors with features indicating possible morphological features for each word. In early experiments with ME classifiers, we found that our attempts to do proper gender and number tagging in French text did not improve classification performance noticeably, presumably because the annotation was too noisy. In more recent experiments, we just add features indicating all possible morphological interpretations of each word, rather than trying to disambiguate them. To do this, we look up the morphological annotations of the French words in the Lefff dictionary (Sagot et al., 2006) and intro- 388 duce a set of new binary features to indicate whether a particular reading of a word occurs in that dictionary. These features are then added to the one-hot representation of the antecedent words. Doing so improves the classifier accuracy to 0.71 1 (TED) and 0.604 (News commentaries), while the F-scores for elles reach 0.589 (TED; P 0.649, R 0.539) and 0.500 (News commentaries; P 0.545, R 0.462), respectively. 7.3 More Anaphoric Link Features Even though the modified antecedent candidate extraction with its larger context window and without the morphological filter results in better performance on both test sets, additional error analysis reveals that the classifiers has greater problems identifying the correct markable in this setting. One reason for this may be that the baseline anaphoric link feature set described above (Section 6) only includes two very rough binary distance features which indicate whether or not the anaphora and the antecedent candidate occur in the same or in immediately adjacent sentences. With the larger context window, this may be too unspecific. In our final experiment, we there- fore enable some additional features which are available in BART, but disabled in the baseline system: Distance in number of markables Distance in number of sentences Sentence distance, log-transformed Distance in number of words Part of speech of head word Most of these encode the distance between the anaphora and the antecedent candidate in more precise ways. Complete results for this final system are presented in Table 5. Including these additional features leads to another slight increase in accuracy for both corpora, with similar or increased classifier F-scores for most classes except elle in the news commentary condition. In particular, we should like to point out the performance of our benchmark classifier for elles, which suffered from extremely low recall in the first classifiers and approaches the performance ofthe other classes, with nearly balanced precision and recall, in this final system. Since elles is a low-frequency class and cannot be reliably predicted using source context alone, we interpret this as evidence that our final neural network classifier has incorporated some relevant knowledge about pronominal anaphora that the baseline ME clas– – – – – sifier and earlier versions of our network have no access to. This is particularly remarkable because no data manually annotated for coreference was used for training. 8 Related work Even though it was recognised years ago that the information contained in parallel corpora may provide valuable information for the improvement of anaphora resolution systems, there have not been many attempts to cash in on this insight. Mitkov and Barbu (2003) exploit parallel data in English and French to improve pronominal anaphora resolution by combining anaphora resolvers for the individual languages with handwritten rules to resolve conflicts between the output of the language-specific resolvers. Veselovská et al. (2012) apply a similar strategy to English-Czech data to resolve different uses of the pronoun it. Other work has used word alignments to project coreference annotations from one language to another with a view to training anaphora resolvers in the target language (Postolache et al., 2006; de Souza and Or˘ asan, 2011). Rahman and Ng (2012) instead use machine translation to translate their test 389 data into a language for which they have an anaphora resolver and then project the annotations back to the original language. Completely unsupervised monolingual anaphora resolution has been approached using, e. g., Markov logic (Poon and Domingos, 2008) and the Expectation-Maximisation algorithm (Cherry and Bergsma, 2005; Charniak and Elsner, 2009). To the best of our knowledge, the direct application of machine learning techniques to parallel data in a task related to anaphora resolution is novel in our work. Neural networks and deep learning techniques have recently gained some popularity in natural language processing. They have been applied to tasks such as language modelling (Bengio et al., 2003; Schwenk, 2007), translation modelling in statistical machine translation (Le et al., 2012), but also part-ofspeech tagging, chunking, named entity recognition and semantic role labelling (Collobert et al., 2011). In tasks related to anaphora resolution, standard feedforward neural networks have been tested as a classifier in an anaphora resolution system (Stuckardt, 2007), but the network design presented in our work is novel. 9 Conclusion In this paper, we have introduced cross-lingual pronoun prediction as an independent natural language processing task. Even though it is not an end-to-end task, pronoun prediction is interesting for several reasons. It is related to the problem of pronoun translation in SMT, a currently unsolved problem that has been addressed in a number of recent research publications (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) without reaching a majorbreakthrough. In this work, we have shown that pronoun prediction can be effectively modelled in a neural network architecture with relatively simple features. More importantly, we have demonstrated that the task can be exploited to train a classifier with a latent representation of anaphoric links. With parallel text as its only supervision this classifier achieves a level of performance that is similar to, if not better than, that of a classifier using a regular anaphora resolution system trained with manually annotated data. References Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1 155. Samuel Broscheit, Massimo Poesio, Simone Paolo Ponzetto, Kepa Joseba Rodriguez, Lorenza Romano, Olga Uryupina, Yannick Versley, and Roberto Zanoli. 2010. BART: A multilingual anaphora resolution system. In Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2010), Uppsala, Sweden, 15–16 July 2010. Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Associationfor Machine Translation (EAMT), pages 261–268, Trento, Italy. Eugene Charniak and Micha Elsner. 2009. EM works for pronoun anaphora resolution. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 148–156, Athens, Greece. Colin Cherry and Shane Bergsma. 2005. An Expectation Maximization approach to pronoun resolution. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 88– 95, Ann Arbor, Michigan. Michael Collins. 1999. Head-Driven Statistical Models forNatural Language Parsing. Ph.D. thesis, University of Pennsylvania. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal ofMachine Learning Research, 12:2461–2505. José de Souza and Constantin Or˘ asan. 2011. Can projected chains in parallel corpora help coreference resolution? In Iris Hendrickx, Sobha Lalitha Devi, António Branco, and Ruslan Mitkov, editors, Anaphora Processing and Applications, volume 7099 of Lecture Notes in Computer Science, pages 59–69. Springer, Berlin. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Associationfor Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Dan Klein and Christopher D. Manning. 390 2003. Accu- rate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Associationfor Computational Linguistics, pages 423–430, Sapporo, Japan. Hai-Son Le, Alexandre Allauzen, and François Yvon. 2012. Continuous space translation models with neural networks. In Proceedings ofthe 2012 Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 39–48, Montréal, Canada. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Ruslan Mitkov and Catalina Barbu. 2003. Using bilingual corpora to improve pronoun resolution. Languages in Contrast, 4(2):201–21 1. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29: 19–51. Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 650– 659, Honolulu, Hawaii. Oana Postolache, Dan Cristea, and Constantin Or˘ asan. 2006. Transferring coreference chains through word alignment. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC-2006), pages 889–892, Genoa. Altaf Rahman and Vincent Ng. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the 2012 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 720– 730, Montréal, Canada. Benoît Sagot, Lionel Clément, Éric Villemonte de La Clergerie, and Pierre Boullier. 2006. The Lefff 2 syntactic lexicon for French: architecture, acquisition, use. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC2006), pages 1348–1351, Genoa. Holger Schwenk. 2007. Continuous space language models. Computer Speech and Language, 21(3):492–5 18. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4):521–544. Roland Stuckardt. 2007. Applying backpropagation networks to anaphor resolution. In António Branco, editor, Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Collo- 2007, number 4410 in Lecture Notes in Artificial Intelligence, pages 107–124, Berlin. Kate ˇrina Veselovská, Ngu.y Giang Linh, and Michal Novák. 2012. Using Czech-English parallel corpora in quium, DAARC automatic identification of it. In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pages 112–120, Istanbul, Turkey. 391

6 0.22624208 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves

7 0.18324457 45 emnlp-2013-Chinese Zero Pronoun Resolution: Some Recent Advances

8 0.10604194 160 emnlp-2013-Relational Inference for Wikification

9 0.07896962 108 emnlp-2013-Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data

10 0.075896688 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

11 0.074984185 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

12 0.065926954 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

13 0.064218014 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic

14 0.059977829 88 emnlp-2013-Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding

15 0.057779543 43 emnlp-2013-Cascading Collective Classification for Bridging Anaphora Recognition using a Rich Linguistic Feature Set

16 0.056371782 118 emnlp-2013-Learning Biological Processes with Global Constraints

17 0.055298664 75 emnlp-2013-Event Schema Induction with a Probabilistic Entity-Driven Model

18 0.05368796 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

19 0.053527374 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs

20 0.051957533 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, 0.257), (2, 0.485), (3, -0.192), (4, 0.045), (5, -0.095), (6, 0.039), (7, -0.104), (8, -0.128), (9, 0.039), (10, -0.089), (11, 0.022), (12, 0.071), (13, -0.013), (14, 0.101), (15, 0.089), (16, -0.006), (17, -0.009), (18, 0.002), (19, -0.058), (20, 0.032), (21, -0.037), (22, -0.038), (23, -0.015), (24, -0.083), (25, -0.034), (26, -0.038), (27, 0.082), (28, 0.005), (29, -0.015), (30, -0.015), (31, 0.018), (32, -0.016), (33, -0.028), (34, 0.005), (35, -0.066), (36, -0.033), (37, -0.02), (38, -0.013), (39, 0.015), (40, -0.013), (41, 0.0), (42, 0.015), (43, -0.058), (44, -0.024), (45, -0.066), (46, 0.041), (47, -0.005), (48, 0.035), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98660976 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

Author: Fang Kong ; Hwee Tou Ng

2 0.88227129 67 emnlp-2013-Easy Victories and Uphill Battles in Coreference Resolution

Author: Greg Durrett ; Dan Klein

3 0.83927232 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

Author: Jonathan K. Kummerfeld ; Dan Klein

4 0.7837339 1 emnlp-2013-A Constrained Latent Variable Model for Coreference Resolution

Author: Kai-Wei Chang ; Rajhans Samdani ; Dan Roth

5 0.76024389 45 emnlp-2013-Chinese Zero Pronoun Resolution: Some Recent Advances

Author: Chen Chen ; Vincent Ng

Abstract: We extend Zhao and Ng's (2007) Chinese anaphoric zero pronoun resolver by (1) using a richer set of features and (2) exploiting the coreference links between zero pronouns during resolution. Results on OntoNotes show that our approach significantly outperforms two state-of-the-art anaphoric zero pronoun resolvers. To our knowledge, this is the first work to report results obtained by an end-toend Chinese zero pronoun resolver.

6 0.74157393 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves

7 0.5945102 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

8 0.41942105 108 emnlp-2013-Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data

9 0.35848632 23 emnlp-2013-Animacy Detection with Voting Models

10 0.28715605 160 emnlp-2013-Relational Inference for Wikification

11 0.28593203 43 emnlp-2013-Cascading Collective Classification for Bridging Anaphora Recognition using a Rich Linguistic Feature Set

12 0.22468352 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

13 0.22414002 111 emnlp-2013-Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning

14 0.21921599 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

15 0.21760562 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior

16 0.21401559 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

17 0.20624147 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

18 0.20155515 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

19 0.19992441 188 emnlp-2013-Tree Kernel-based Negation and Speculation Scope Detection with Structured Syntactic Parse Features

20 0.19803259 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.022), (18, 0.024), (22, 0.04), (30, 0.083), (39, 0.18), (45, 0.014), (50, 0.028), (51, 0.138), (58, 0.033), (66, 0.042), (71, 0.039), (75, 0.117), (77, 0.015), (96, 0.099)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80854297 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

Author: Fang Kong ; Hwee Tou Ng

2 0.72615731 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

Author: Aliaksei Severyn ; Alessandro Moschitti

Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.

3 0.72424322 21 emnlp-2013-An Empirical Study Of Semi-Supervised Chinese Word Segmentation Using Co-Training

Author: Fan Yang ; Paul Vozila

Abstract: In this paper we report an empirical study on semi-supervised Chinese word segmentation using co-training. We utilize two segmenters: 1) a word-based segmenter leveraging a word-level language model, and 2) a character-based segmenter using characterlevel features within a CRF-based sequence labeler. These two segmenters are initially trained with a small amount of segmented data, and then iteratively improve each other using the large amount of unlabelled data. Our experimental results show that co-training captures 20% and 31% of the performance improvement achieved by supervised training with an order of magnitude more data for the SIGHAN Bakeoff 2005 PKU and CU corpora respectively.

4 0.72163945 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

Author: Yangfeng Ji ; Jacob Eisenstein

Abstract: Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification. The key idea is that similarity in the latent space implies semantic relatedness. We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification. First, we design a new discriminative term-weighting metric called TF-KLD, which outperforms TF-IDF. Next, we show that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy. Finally, we combine latent features with fine-grained n-gram overlap features, yielding performance that is 3% more accurate than the prior state-of-the-art.

5 0.70948011 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

Author: Congle Zhang ; Daniel S. Weld

Abstract: The distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings, has inspired several Web mining algorithms for paraphrasing semantically equivalent phrases. Unfortunately, these methods have several drawbacks, such as confusing synonyms with antonyms and causes with effects. This paper introduces three Temporal Correspondence Heuristics, that characterize regularities in parallel news streams, and shows how they may be used to generate high precision paraphrases for event relations. We encode the heuristics in a probabilistic graphical model to create the NEWSSPIKE algorithm for mining news streams. We present experiments demonstrating that NEWSSPIKE significantly outperforms several competitive baselines. In order to spur further research, we provide a large annotated corpus of timestamped news arti- cles as well as the paraphrases produced by NEWSSPIKE.

6 0.70565265 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

7 0.69990301 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter

8 0.69956285 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

9 0.69750088 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

10 0.69609785 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

11 0.69443911 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations

12 0.69283253 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves

13 0.6717546 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

14 0.67079216 45 emnlp-2013-Chinese Zero Pronoun Resolution: Some Recent Advances

15 0.66832888 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

16 0.66310573 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

17 0.66228497 65 emnlp-2013-Document Summarization via Guided Sentence Compression

18 0.66125685 67 emnlp-2013-Easy Victories and Uphill Battles in Coreference Resolution

19 0.65743428 18 emnlp-2013-A temporal model of text periodicities using Gaussian Processes

20 0.65683758 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations