emnlp emnlp2013 emnlp2013-193 knowledge-graph by maker-knowledge-mining

193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

Source: pdf

Author: Mike Lewis ; Mark Steedman

Abstract: Creating a language-independent meaning representation would benefit many crosslingual NLP tasks. We introduce the first unsupervised approach to this problem, learning clusters of semantically equivalent English and French relations between referring expressions, based on their named-entity arguments in large monolingual corpora. The clusters can be used as language-independent semantic relations, by mapping clustered expressions in different languages onto the same relation. Our approach needs no parallel text for training, but outperforms a baseline that uses machine translation on a cross-lingual question answering task. We also show how to use the semantics to improve the accuracy of machine translation, by using it in a simple reranker.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Unsupervised Induction of Cross-lingual Semantic Relations Mike Lewis School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK mike . [sent-1, score-0.076]

2 uk Abstract Creating a language-independent meaning representation would benefit many crosslingual NLP tasks. [sent-4, score-0.299]

3 We introduce the first unsupervised approach to this problem, learning clusters of semantically equivalent English and French relations between referring expressions, based on their named-entity arguments in large monolingual corpora. [sent-5, score-0.657]

4 The clusters can be used as language-independent semantic relations, by mapping clustered expressions in different languages onto the same relation. [sent-6, score-0.542]

5 Our approach needs no parallel text for training, but outperforms a baseline that uses machine translation on a cross-lingual question answering task. [sent-7, score-0.43]

6 We also show how to use the semantics to improve the accuracy of machine translation, by using it in a simple reranker. [sent-8, score-0.115]

7 1 Introduction Identifying a language-independent semantics is a major long term goal of computational linguistics, and is interesting both theoretically and for practical applications. [sent-9, score-0.161]

8 It assumes that semantically equivalent sentences in any language can be mapped onto a common meaning representation. [sent-10, score-0.416]

9 Such a representation would be of great utility for tasks such as translation, relation extraction, summarization, question answering, and information retrieval. [sent-11, score-0.179]

10 Regardless of whether it is even possible to create such a semantics, we show that an incomplete version can be useful for downstream tasks. [sent-12, score-0.043]

11 Semantic machine translation aims to map a source language to a language-independent meaning representation, and then generate the target language 681 Mark Steedman School of Informatics University of Edinburgh Edinburgh, EH8 9AB, UK steedman@ inf . [sent-13, score-0.362]

12 It is hoped this would alleviate the difficulties of simpler models when translating between languages with very different word ordering and syntax (Vauquois, 1968). [sent-17, score-0.3]

13 Despite many attempts to define interlingual representations (Mitamura et al. [sent-18, score-0.12]

14 , 2013), state-of-the-art machine translation still uses phrase-based models (Koehn et al. [sent-21, score-0.139]

15 The major obstacle to defining interlinguas has been devising a meaning representation that is languageindependent, but capable of expressing the limitless number of meanings that natural languages can express (Dorr et al. [sent-23, score-0.572]

16 Our approach avoids this problem by utilizing the methods of distributional semantics. [sent-25, score-0.126]

17 Recent work has shown that paraphrases of expressions can be learned by clustering those with similar arguments (Poon and Domingos, 2009; Yao et al. [sent-26, score-0.23]

18 , 2011; Lewis and Steedman, 2013)—for example learning that X wrote Y and X is the author of Y are equivalent if they appear in a corpus with similar (X, Y) argument- pairs such as {(Shakespeare, Macbeth), (Dickens, pOaliivrser s Twist)}. [sent-27, score-0.229]

19 WShea keexstepnedar teh,i sM to tbheteh multilingual case, aiming }to. [sent-28, score-0.092]

20 a Wlseo map dth teh iFsre tnoc thh equivalents a Xl a ´e crit Y and Y est un roman de X on to the same cluster as the English paraphrases. [sent-29, score-0.532]

21 Conceptually, we treat a foreign expression as a paraphrase of an English expression. [sent-30, score-0.075]

22 The cluster identifier can be used as a predicate in a logical form, suggesting that the fundamental predicates of an interlingua can be learnt in an unsupervised manner via clustering. [sent-31, score-0.845]

23 In this paper we focus on learning binary relations between named entities. [sent-32, score-0.164]

24 This problem is much simpler than attempting complete interlingual semantic ProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3. [sent-33, score-0.268]

25 This class of expressions has proved extremely useful in the monolingual case, with direct applications for question answering and relation extraction (Poon and Domingos, 2009; Mintz et al. [sent-36, score-0.522]

26 It is important to be able to extract knowledge across languages, as many facts will not be expressed in all languages—either due to less- complete encyclopedias being available in some languages, or facts being most relevant to a single country. [sent-38, score-0.208]

27 In contrast to most previous work on machine translation and cross-lingual clustering, our method requires no parallel text (see Section 8 for discussion of some exceptions). [sent-39, score-0.236]

28 The limited size of parallel corpora is a significant bottleneck for machine translation (Resnik and Smith, 2003), whereas our approach can be used on much larger monolingual corpora. [sent-41, score-0.469]

29 This means it is potentially useful for language-pairs where little parallel text is available, for domain adaptation, or for semisupervised approaches. [sent-42, score-0.138]

30 2 Basic Approach Our work builds on clustering-based approaches to monolingual distributional semantics, aiming to create clusters of semantically equivalent predicates, based on their arguments in a corpus. [sent-43, score-0.707]

31 In each language, we first map each sentence in a large monolingual corpus onto a simple logical form, by extracting binary predicates between named entities. [sent-44, score-0.989]

32 Then, we cluster predicates both within and between languages into those with similar arguments. [sent-45, score-0.574]

33 When parsing a new sentence, instead of using the monolingual predicate, we use the cluster identifier as a language-independent semantic relation, as shown in Figure 1. [sent-46, score-0.444]

34 The resulting logical form can be used for inference in question answering. [sent-47, score-0.239]

35 Unlike traditional approaches to translation, this does not require parallel text—but it does impose some additional constraints on language resources. [sent-48, score-0.139]

36 Our approach requires: • A large amount of factual text, as we rely on tAhe l same fmacotus being expressed i,n a sdi wffeere rently yla onnguages. [sent-49, score-0.06]

37 • A method for extracting binary relations from sAen mteentchoesd. [sent-52, score-0.213]

38 • 3 A method for linking entities in the training dAata m ettoh some c lainnkoninicgal e representation. [sent-57, score-0.037]

39 Predicate Extraction Our method relies on extracting binary predicates between entities from sentences. [sent-62, score-0.488]

40 Various representations have been suggested for binary predicates, such as Reverb patterns (Fader et al. [sent-63, score-0.075]

41 , 2011), dependency paths (Lin and Pantel, 2001 ; Yao et al. [sent-64, score-0.038]

42 , 2011), and binarized predicate-argument relations derived from a CCG-parse (Lewis and Steedman, 2013). [sent-65, score-0.143]

43 Our approach is formalism-independent, and is compatible with any method of expressing binary predicates. [sent-66, score-0.186]

44 We choose the CCG-based parser of Lewis and Steedman (2013) for several reasons. [sent-67, score-0.038]

45 It outputs a logical form derived automatically from the CCG-parse, containing predicates such as: writearg0,arg1(shakespeare,macbeth). [sent-68, score-0.493]

46 Using a dependency-based representation, these would have different predicates, which would need to be clustered later. [sent-70, score-0.071]

47 CCG also has a well developed theory of operator semantics (Steedman, 2012), so is able to represent semantic operators such as quantifiers, negation and tense— understanding these is crucial to high performance on question answering or translation tasks. [sent-71, score-0.619]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('steedman', 0.339), ('predicates', 0.327), ('macbeth', 0.243), ('shakespeare', 0.24), ('lewis', 0.197), ('monolingual', 0.185), ('logical', 0.166), ('languages', 0.149), ('translation', 0.139), ('edinburgh', 0.123), ('equivalent', 0.122), ('answering', 0.121), ('interlingual', 0.12), ('semantics', 0.115), ('identifier', 0.113), ('wrote', 0.107), ('semantically', 0.104), ('expressions', 0.102), ('onto', 0.1), ('ccg', 0.099), ('fader', 0.099), ('uk', 0.098), ('predicate', 0.098), ('cluster', 0.098), ('parallel', 0.097), ('aiming', 0.092), ('domingos', 0.092), ('poon', 0.092), ('meaning', 0.09), ('relations', 0.089), ('map', 0.087), ('arguments', 0.085), ('yao', 0.082), ('mike', 0.076), ('binary', 0.075), ('expressing', 0.074), ('question', 0.073), ('facts', 0.072), ('informatics', 0.072), ('clusters', 0.072), ('clustered', 0.071), ('crit', 0.07), ('twist', 0.07), ('limitless', 0.07), ('representation', 0.065), ('encyclopedias', 0.064), ('devising', 0.064), ('obstacle', 0.06), ('equivalents', 0.06), ('thh', 0.06), ('factual', 0.06), ('mintz', 0.06), ('roman', 0.06), ('quantifiers', 0.057), ('languageindependent', 0.054), ('binarized', 0.054), ('dth', 0.051), ('difficulties', 0.051), ('conceptually', 0.051), ('tahe', 0.051), ('xl', 0.051), ('reverb', 0.051), ('simpler', 0.051), ('school', 0.05), ('attempting', 0.049), ('syntax', 0.049), ('extracting', 0.049), ('semantic', 0.048), ('bottleneck', 0.048), ('dorr', 0.048), ('distributional', 0.047), ('june', 0.046), ('inf', 0.046), ('crosslingual', 0.046), ('sm', 0.046), ('theoretically', 0.046), ('est', 0.046), ('learnt', 0.043), ('downstream', 0.043), ('paraphrases', 0.043), ('conjunctions', 0.043), ('impose', 0.042), ('resnik', 0.042), ('negation', 0.042), ('avoids', 0.041), ('clauses', 0.041), ('operator', 0.041), ('semisupervised', 0.041), ('relation', 0.041), ('operators', 0.04), ('newswire', 0.039), ('tense', 0.039), ('utilizing', 0.038), ('exceptions', 0.038), ('paths', 0.038), ('foreign', 0.038), ('parser', 0.038), ('entities', 0.037), ('compatible', 0.037), ('paraphrase', 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

Author: Mike Lewis ; Mark Steedman

2 0.23744793 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

Author: Jonathan Berant ; Andrew Chou ; Roy Frostig ; Percy Liang

Abstract: In this paper, we train a semantic parser that scales up to Freebase. Instead of relying on annotated logical forms, which is especially expensive to obtain at large scale, we learn from question-answer pairs. The main challenge in this setting is narrowing down the huge number of possible logical predicates for a given question. We tackle this problem in two ways: First, we build a coarse mapping from phrases to predicates using a knowledge base and a large text corpus. Second, we use a bridging operation to generate additional predicates based on neighboring predicates. On the dataset ofCai and Yates (2013), despite not having annotated logical forms, our system outperforms their state-of-the-art parser. Additionally, we collected a more realistic and challenging dataset of question-answer pairs and improves over a natural baseline.

3 0.1789142 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

Author: Wiltrud Kessler ; Jonas Kuhn

Abstract: This short paper presents a pilot study investigating the training of a standard Semantic Role Labeling (SRL) system on product reviews for the new task of detecting comparisons. An (opinionated) comparison consists of a comparative “predicate” and up to three “arguments”: the entity evaluated positively, the entity evaluated negatively, and the aspect under which the comparison is made. In user-generated product reviews, the “predicate” and “arguments” are expressed in highly heterogeneous ways; but since the elements are textually annotated in existing datasets, SRL is technically applicable. We address the interesting question how well training an outof-the-box SRL model works for English data. We observe that even without any feature engineering or other major adaptions to our task, the system outperforms a reasonable heuristic baseline in all steps (predicate identification, argument identification and argument classification) and in three different datasets.

4 0.16983914 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching

Author: Tom Kwiatkowski ; Eunsol Choi ; Yoav Artzi ; Luke Zettlemoyer

Abstract: We consider the challenge of learning semantic parsers that scale to large, open-domain problems, such as question answering with Freebase. In such settings, the sentences cover a wide variety of topics and include many phrases whose meaning is difficult to represent in a fixed target ontology. For example, even simple phrases such as ‘daughter’ and ‘number of people living in’ cannot be directly represented in Freebase, whose ontology instead encodes facts about gender, parenthood, and population. In this paper, we introduce a new semantic parsing approach that learns to resolve such ontological mismatches. The parser is learned from question-answer pairs, uses a probabilistic CCG to build linguistically motivated logicalform meaning representations, and includes an ontology matching model that adapts the output logical forms for each target ontology. Experiments demonstrate state-of-the-art performance on two benchmark semantic parsing datasets, including a nine point accuracy improvement on a recent Freebase QA corpus.

5 0.16600871 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation

Author: Nicholas FitzGerald ; Yoav Artzi ; Luke Zettlemoyer

Abstract: We present a new approach to referring expression generation, casting it as a density estimation problem where the goal is to learn distributions over logical expressions identifying sets of objects in the world. Despite an extremely large space of possible expressions, we demonstrate effective learning of a globally normalized log-linear distribution. This learning is enabled by a new, multi-stage approximate inference technique that uses a pruning model to construct only the most likely logical forms. We train and evaluate the approach on a new corpus of references to sets of visual objects. Experiments show the approach is able to learn accurate models, which generate over 87% of the expressions people used. Additionally, on the previously studied special case of single object reference, we show a 35% relative error reduction over previous state of the art.

6 0.16067423 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

7 0.10508352 183 emnlp-2013-The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of Verbs

8 0.10318698 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

9 0.097558312 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

10 0.094876781 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

11 0.086135834 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic

12 0.084721774 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

13 0.082726866 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation

14 0.081831656 84 emnlp-2013-Factored Soft Source Syntactic Constraints for Hierarchical Machine Translation

15 0.08043737 96 emnlp-2013-Identifying Phrasal Verbs Using Many Bilingual Corpora

16 0.075719386 104 emnlp-2013-Improving Statistical Machine Translation with Word Class Models

17 0.07277929 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

18 0.072142616 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

19 0.071420424 127 emnlp-2013-Max-Margin Synchronous Grammar Induction for Machine Translation

20 0.069410749 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.234), (1, -0.007), (2, 0.001), (3, 0.068), (4, 0.003), (5, 0.242), (6, -0.171), (7, -0.019), (8, 0.175), (9, -0.011), (10, 0.131), (11, -0.103), (12, 0.275), (13, -0.065), (14, 0.085), (15, 0.12), (16, -0.038), (17, -0.042), (18, -0.071), (19, 0.003), (20, -0.043), (21, 0.021), (22, -0.016), (23, -0.087), (24, -0.022), (25, 0.052), (26, -0.152), (27, 0.079), (28, 0.083), (29, 0.112), (30, 0.027), (31, -0.061), (32, 0.006), (33, -0.089), (34, -0.041), (35, -0.09), (36, 0.058), (37, -0.02), (38, -0.017), (39, 0.048), (40, -0.058), (41, -0.072), (42, 0.062), (43, 0.012), (44, -0.011), (45, -0.03), (46, -0.029), (47, 0.011), (48, 0.005), (49, -0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95894533 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

Author: Mike Lewis ; Mark Steedman

2 0.71445435 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

Author: Eduardo Blanco ; Dan Moldovan

Abstract: This paper presents a novel approach to determine textual similarity. A layered methodology to transform text into logic forms is proposed, and semantic features are derived from a logic prover. Experimental results show that incorporating the semantic structure of sentences is beneficial. When training data is unavailable, scores obtained from the logic prover in an unsupervised manner outperform supervised methods.

3 0.68210846 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

Author: Jonathan Berant ; Andrew Chou ; Roy Frostig ; Percy Liang

4 0.63586932 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching

Author: Tom Kwiatkowski ; Eunsol Choi ; Yoav Artzi ; Luke Zettlemoyer

5 0.59274477 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

Author: Wiltrud Kessler ; Jonas Kuhn

6 0.58276165 183 emnlp-2013-The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of Verbs

7 0.52491432 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation

8 0.40483621 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

9 0.3835429 63 emnlp-2013-Discourse Level Explanatory Relation Extraction from Product Reviews Using First-Order Logic

10 0.38298178 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

11 0.37534824 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

12 0.36899146 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

13 0.35815334 57 emnlp-2013-Dependency-Based Decipherment for Resource-Limited Machine Translation

14 0.35345966 103 emnlp-2013-Improving Pivot-Based Statistical Machine Translation Using Random Walk

15 0.34362188 135 emnlp-2013-Monolingual Marginal Matching for Translation Model Adaptation

16 0.32793647 152 emnlp-2013-Predicting the Presence of Discourse Connectives

17 0.30977744 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions

18 0.29762015 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification

19 0.29754797 156 emnlp-2013-Recurrent Continuous Translation Models

20 0.29329211 136 emnlp-2013-Multi-Domain Adaptation for SMT Using Multi-Task Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.034), (18, 0.075), (22, 0.041), (30, 0.068), (51, 0.251), (66, 0.024), (71, 0.02), (75, 0.105), (77, 0.049), (89, 0.211), (96, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87669909 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

Author: Mike Lewis ; Mark Steedman

2 0.85608876 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

Author: Shize Xu ; Shanshan Wang ; Yan Zhang

Abstract: The rapid development of Web2.0 leads to significant information redundancy. Especially for a complex news event, it is difficult to understand its general idea within a single coherent picture. A complex event often contains branches, intertwining narratives and side news which are all called storylines. In this paper, we propose a novel solution to tackle the challenging problem of storylines extraction and reconstruction. Specifically, we first investigate two requisite properties of an ideal storyline. Then a unified algorithm is devised to extract all effective storylines by optimizing these properties at the same time. Finally, we reconstruct all extracted lines and generate the high-quality story map. Experiments on real-world datasets show that our method is quite efficient and highly competitive, which can bring about quicker, clearer and deeper comprehension to readers.

3 0.78800964 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

Author: Jason Weston ; Antoine Bordes ; Oksana Yakhnenko ; Nicolas Usunier

Abstract: This paper proposes a novel approach for relation extraction from free text which is trained to jointly use information from the text and from existing knowledge. Our model is based on scoring functions that operate by learning low-dimensional embeddings of words, entities and relationships from a knowledge base. We empirically show on New York Times articles aligned with Freebase relations that our approach is able to efficiently use the extra information provided by a large subset of Freebase data (4M entities, 23k relationships) to improve over methods that rely on text features alone.

4 0.78648096 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

Author: Yangfeng Ji ; Jacob Eisenstein

Abstract: Matrix and tensor factorization have been applied to a number of semantic relatedness tasks, including paraphrase identification. The key idea is that similarity in the latent space implies semantic relatedness. We describe three ways in which labeled data can improve the accuracy of these approaches on paraphrase classification. First, we design a new discriminative term-weighting metric called TF-KLD, which outperforms TF-IDF. Next, we show that using the latent representation from matrix factorization as features in a classification algorithm substantially improves accuracy. Finally, we combine latent features with fine-grained n-gram overlap features, yielding performance that is 3% more accurate than the prior state-of-the-art.

5 0.78253764 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

Author: Christian Hardmeier ; Jorg Tiedemann ; Joakim Nivre

Abstract: This paper addresses the task of predicting the correct French translations of third-person subject pronouns in English discourse, a problem that is relevant as a prerequisite for machine translation and that requires anaphora resolution. We present an approach based on neural networks that models anaphoric links as latent variables and show that its performance is competitive with that of a system with separate anaphora resolution while not requiring any coreference-annotated training data. This demonstrates that the information contained in parallel bitexts can successfully be used to acquire knowledge about pronominal anaphora in an unsupervised way. 1 Motivation When texts are translated from one language into another, the translation reconstructs the meaning or function of the source text with the means of the target language. Generally, this has the effect that the entities occurring in the translation and their mutual relations will display similar patterns as the entities in the source text. In particular, coreference patterns tend to be very similar in translations of a text, and this fact has been exploited with good results to project coreference annotations from one language into another by using word alignments (Postolache et al., 2006; Rahman and Ng, 2012). On the other hand, what is true in general need not be true for all types of linguistic elements. For instance, a substantial percentage ofthe English thirdperson subject pronouns he, she, it and they does not get realised as pronouns in French translations (Hardmeier, 2012). Moreover, it has been recognised 380 by various authors in the statistical machine translation (SMT) community (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) that pronoun translation is a difficult problem because, even when a pronoun does get translated as a pronoun, it may require choosing the correct word form based on agreement features that are not easily pre- dictable from the source text. The work presented in this paper investigates the problem of cross-lingual pronoun prediction for English-French. Given an English pronoun and its discourse context as well as a French translation of the same discourse and word alignments between the two languages, we attempt to predict the French word aligned to the English pronoun. As far as we know, this task has not been addressed in the literature before. In our opinion, it is interesting for several reasons. By studying pronoun prediction as a task in its own right, we hope to contribute towards a better understanding of pronoun translation with a longterm view to improving the performance of SMT systems. Moreover, we believe that this task can lead to interesting insights about anaphora resolution in a multi-lingual context. In particular, we show in this paper that the pronoun prediction task makes it possible to model the resolution of pronominal anaphora as a latent variable and opens up a way to solve a task relying on anaphora resolution without using any data annotated for anaphora. This is what we consider the main contribution of our present work. We start by modelling cross-lingual pronoun pre- diction as an independent machine learning task after doing anaphora resolution in the source language (English) using the BART software (Broscheit et al., 2010). We show that it is difficult to achieve satisfactory performance with standard maximumProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 3t8ic0s–391, The latest version released in March is equipped with ...It is sold at ... La dernière version lancée en mars est dotée de ... • est vendue ... • Figure 1: Task setup entropy classifiers especially for low-frequency pronouns such as the French feminine plural pronoun elles. We propose a neural network classifier that achieves better precision and recall and manages to make reasonable predictions for all pronoun categories in many cases. We then go on to extend our neural network architecture to include anaphoric links as latent variables. We demonstrate that our classifier, now with its own source language anaphora resolver, can be trained successfully with backpropagation. In this setup, we no longer use the machine learning component included in the external coreference resolution system (BART) to predict anaphoric links. Anaphora resolution is done by our neural network classifier and requires only some quantity of word-aligned parallel data for training, completely obviating the need for a coreference-annotated training set. 2 Task Setup The overall setup of the classification task we address in this paper is shown in Figure 1. We are given an English discourse containing a pronoun along with its French translation and word alignments between the two languages, which in our case were computed automatically using a standard SMT pipeline with GIZA++ (Och and Ney, 2003). We focus on the four English third-person subject pronouns he, she, it and they. The output of the classifier is a multinomial distribution over six classes: the four French subject pronouns il, elle, ils and elles, corresponding to masculine and feminine singular and plural, respectively; the impersonal pronoun ce/c’, which occurs in some very frequent constructions such as c’est (it is); and a sixth class OTHER, which indicates that none of these pronouns was used. In general, a pronoun may be aligned to multiple words; in this case, a training example is counted as a positive example for a class if the target word occurs among the words aligned to the pronoun, irrespective of the presence of other 381 word candidate training ex. verseiol ena0 0 1 01 10 0 0 .0510 .50 p 12= . 910. 5.9 050 Figure 2: Antecedent feature aggregation aligned tokens. This task setup resembles the problem that an SMT system would have to solve to make informed choices when translating pronouns, an aspect oftranslation neglected by most existing SMT systems. An important difference between the SMT setup and our own classifiers is that we use context from humanmade translations for prediction. This potentially makes the task both easier and more difficult; easier, because the context can be relied on to be correctly translated, and more difficult, because human translators frequently create less literal translations than an SMT system would. Integrating pronoun prediction into the translation process would require significant changes to the standard SMT decoding setup in order to take long-range dependencies in the target language into account, which is why we do not address this issue in our current work. In all the experiments presented in this paper, we used features from two different sources: Anaphora context features describe the source language pronoun and its immediate context consisting of three words to its left and three words to its right. They are encoded as vectors whose dimensionality is equal to the source vocabulary size with a single non-zero component indicating the word referred to (one-hot vectors). Antecedent features describe an antecedent candidate. Antecedent candidates are represented by the target language words aligned to the syntactic head of the source language markable TED News ce 16.3 % 6.4 % elle 7.1 % 10.1 % elles 3.0 % 3.9 % il 17.1 % 26.5 % ils 15.6 % 15.1 % OTHER 40.9 % 38.0 % – – Table 1: Distribution of classes in the training data noun phrase as identified by the Collins head finder (Collins, 1999). The different handling of anaphora context features and antecedent features is due to the fact that we always consider a constant number of context words on the source side, whereas the number of word vectors to be considered depends on the number of antecedent candidates and on the number of target words aligned to each antecedent. The encoding of the antecedent features is illustrated in Figure 2 for a training example with two antecedent candidates translated to elle and la version, respectively. The target words are represented as one-hot vectors with the dimensionality of the target language vocabulary. These vectors are then averaged to yield a single vector per antecedent candidate. Finally, the vectors of all candidates for a given training example are weighted by the probabilities assigned to them by the anaphora resolver (p1 and p2) and summed to yield a single vector per training example. 3 Data Sets and External Tools We run experiments with two different test sets. The TED data set consists of around 2.6 million tokens of lecture subtitles released in the WIT3 corpus (Cettolo et al., 2012). The WIT3 training data yields 71,052 examples, which were randomly partitioned into a training set of 63,228 examples and a test set of 7,824 examples. The official WIT3 development and test sets were not used in our experiments. The news-commentary data set is version 6 of the parallel news-commentary corpus released as a part of the WMT 2011training data1 . It contains around 2.8 million tokens ofnews text and yields 3 1,017 data points, 1http: //www. statmt .org/wmt11/translation-task. html (3 July 2013). 382 which were randomly split into 27,900 training examples and 3,117 test instances. The distribution of the classes in the two training sets is shown in Table 1. One thing to note is the dominance of the OTHER class, which pools together such different phenomena as translations with other pronouns not in our list (e. g., celui-ci) and translations with full noun phrases instead of pronouns. Splitting this group into more meaningful subcategories is not straightforward and must be left to future work. The feature setup of all our classifiers requires the detection of potential antecedents and the extraction of features pairing anaphoric pronouns with antecedent candidates. Some of our experiments also rely on an external anaphora resolution component. We use the open-source anaphora resolver BART to generate this information. BART (Broscheit et al., 2010) is an anaphora resolution toolkit consisting of a markable detection and feature extraction pipeline based on a variety of standard natural language processing (NLP) tools and a machine learning component to predict coreference links including both pronominal anaphora and noun-noun coreference. In our experiments, we always use BART’s markable detection and feature extraction machinery. Markable detection is based on the identification of noun phrases in constituency parses generated with the Stanford parser (Klein and Manning, 2003). The set of features extracted by BART is an extension of the widely used mention-pair anaphora resolution feature set by Soon et al. (2001) (see below, Section 6). In the experiments of the next two sections, we also use BART to predict anaphoric links for pronouns. The model used with BART is a maximum entropy ranker trained on the ACE02-npaper corpus (LDC2003T1 1). In order to obtain a probability distribution over antecedent candidates rather than onebest predictions or coreference sets, we modified the ranking component with which BART resolves pronouns to normalise and output the scores assigned by the ranker to all candidates instead of picking the highest-scoring candidate. 4 Baseline Classifiers In order to create a simple, but reasonable baseline for our task, we trained a maximum entropy (ME) ce TED (Accuracy: 0.685) P R 0.593 0.728 F 0.654 elle 0.798 0.523 elles 0.812 0.164 il 0.764 0.550 ils 0.632 0.949 OTHER 0.724 0.692 News commentary (Accuracy: 0.576) ce elle elles il ils OTHER P 0.508 0.530 0.538 0.600 0.593 0.564 R 0.294 0.312 0.062 0.666 0.769 0.609 Table 2: Maximum entropy classifier results 0.632 0.273 0.639 0.759 0.708 F 0.373 0.393 0.111 0.631 0.670 0.586 TED (Accuracy: 0.700) P R ce 0.634 0.747 elle 0.756 0.617 elles 0.679 0.319 il 0.719 0.591 ils 0.663 0.940 OTHER 0.743 0.678 News commentary (Accuracy: 0.576) F 0.686 0.679 0.434 0.649 0.778 0.709 P 0.477 0.498 F 0.400 0.444 ce elle R 0.344 0.401 elles il ils OTHER 0.565 0.655 0.570 0.567 0.116 0.626 0.834 0.573 0.193 0.640 0.677 0.570 Table 3: Neural network classifier with anaphoras resolved by BART classifier with the MegaM software package2 using the features described in the previous section and the anaphora links found by BART. Results are shown in Table 2. The baseline results show an overall higher accuracy for the TED data than for the newscommentary data. While the precision is above 50 % in all categories and considerably higher in some, recall varies widely. The pronoun elles is particularly interesting. This is the feminine plural of the personal pronoun, and it usually corresponds to the English pronoun they, which is not marked for gender. In French, elles is a marked choice which is only used if the antecedent exclusively refers to females or feminine-gendered objects. The presence of a single item with masculine grammatical gender in the antecedent will trigger the use of the masculine plural pronoun ils instead. This distinction cannot be predicted from the English source pronoun or its context; making correct predictions requires knowledge about the antecedent of the pronoun. Moreover, elles is a low-frequency pronoun. There are only 1,909 occurrences of this pro2http : //www . umiacs .umd .edu/~hal/megam/ (20 June 2013). 383 noun in the TED training data, and 1,077 in the newscommentary training set. Because of these special properties of the feminine plural class, we argue that the performance of a classifier on elles is a good indicator ofhow well it can represent relevant knowledge about pronominal anaphora as opposed to overfitting to source contexts or acting on prior assumptions about class frequencies. In accordance with the general linguistic preference for ils, the classifier tends to predict ils much more often than elles when encountering an English plural pronoun. This is reflected in the fact that elles has much lower recall than ils. Clearly, the classifier achieves a good part of its accuracy by making ma- jority choices without exploiting deeper knowledge about the antecedents of pronouns. An additional experiment with a subset of 27,900 training examples from the TED data confirms that the difference between TED and news commentaries is not just an effect of training data size, but that TED data is genuinely easier to predict than news commentaries. In the reduced data TED condition, the classifier achieves an accuracy of 0.673. Precision and recall of all classifiers are much closer to the Figure 3: Neural network for pronoun prediction large-data TED condition than to the news commentary experiments, except for elles, where we obtain an F-score of 0.072 (P 0.818, R 0.038), indicating that small training data size is a serious problem for this low-frequency class. 5 Neural Network Classifier In the previous section, we saw that a simple multiclass maximum entropy classifier, while making correct predictions for much of the data set, has a significant bias towards making majority class decisions, relying more on prior assumptions about the frequency distribution of the classes than on antecedent features when handling examples of less frequent classes. In order to create a system that can be trained to rely more explicitly on antecedent information, we created a neural network classifier for our task. The introduction of a hidden layer should enable the classifier to learn abstract concepts such as gender and number that are useful across multiple output categories, so that the performance of sparsely represented classes can benefit from the training examples of the more frequent classes. The overall structure of the network is shown in Figure 3. As inputs, the network takes the same features that were available to the baseline ME classifier, based on the source pronoun (P) with three words of context to its left (L1 to L3) and three words to its right (R1 to R3) as well as the words aligned to the syntactic head words of all possible antecedent candidates as found by BART (A). All words are 384 encoded as one-hot vectors whose dimensionality is equal to the vocabulary size. If multiple words are aligned to the syntactic head of an antecedent candidate, their word vectors are averaged with uniform weights. The resulting vectors for each antecedent are then averaged with weights defined by the posterior distribution of the anaphora resolver in BART (p1 to p3). The network has two hidden layers. The first layer (E) maps the input word vectors to a low-dimensional representation. In this layer, the embedding weights for all the source language vectors (the pronoun and its 6 context words) are tied, so if two words are the same, they are mapped to the same lowerdimensional embedding irrespective of their position relative to the pronoun. The embedding of the antecedent word vectors is independent, as these word vectors represent target language words. The entire embedding layer is then mapped to another hidden layer (H), which is in turn connected to a softmax output layer (S) with 6 outputs representing the classes ce, elle, elles, il, ils and OTHER. The non-linearity of both hidden layers is the logistic sigmoid function, f(x) = 1/(1 + e−x). In all experiments reported in this paper, the dimensionality of the source and target language word embeddings is 20, resulting in a total embedding layer size of 160, and the size of the last hidden layer is equal to 50. These sizes are fairly small. In experiments with larger layer sizes, we were able to obtain similar, but no better results. The neural network is trained with mini-batch stochastic gradient descent with backpropagated gradients using the RMSPROP algorithm with crossentropy as the objective function.3 In contrast to standard gradient descent, RMSPROP normalises the magnitude of the gradient components by dividing them by a root-mean-square moving average. We found this led to faster convergence. Other features of our training algorithm include the use of momentum to even out gradient oscillations, adaptive learning rates for each weight as well as adaptation of the global learning rate as a function of current training progress. The network is regularised with an ‘2 weight penalty. Good settings of the initial learning rate and the weight cost parameter (both around 0.001 in most experiments) were found by manual experi- mentation. Generally, we train our networks for 300 epochs, compute the validation error on a held-out set of some 10 % of the training data after each epoch and use the model that achieved the lowest validation error for testing. Since the source context features are very informative and it is comparatively more difficult to learn from the antecedents, the network sometimes had a tendency to overfit to the source features and disregard antecedent information. We found that this problem can be solved effectively by presenting a part of the training without any source features, forcing the network to learn from the information contained in the antecedents. In all experiments in this paper, we zero out all source features (input layers P, L1to L3 and R1 to R3) with a probability of 50 % in each training example. At test time, no information is zeroed out. Classification results with this network are shown in Table 3. We note that the accuracy has increased slightly for the TED test set and remains exactly the same for the news commentary corpus. However, a closer look on the results for individual classes reveals that the neural network makes better predictions for almost all classes. In terms of F-score, the only class that becomes slightly worse is the OTHER class for the news commentary corpus because of lower recall, indicating that the neural network classifier is less biased towards using the uninformative OTHER 3Our training procedure is greatly inspired by a series of online lectures held by Geoffrey Hinton in 2012 (https : //www . coursera. .org/course/neuralnets, 10 September 2013). 385 category. Recall for elle and elles increases considerably, but especially for elles it is still quite low. The increase in recall comes with some loss in precision, but the net effect on F-score is clearly positive. 6 Latent Anaphora Resolution Considering Figure 1 again, we note that the bilingual setting of our classification task adds some information not available to the monolingual anaphora resolver that can be helpful when determining the correct antecedent for a given pronoun. Knowing the gender of the translation of a pronoun limits the set of possible antecedents to those whose translation is morphologically compatible with the target language pronoun. We can exploit this fact to learn how to resolve anaphoric pronouns without requiring data with manually annotated anaphoric links. To achieve this, we extend our neural network with a component to predict the probability of each antecedent candidate to be the correct antecedent (Figure 4). The extended network is identical to the previous version except for the upper left part dealing with anaphoric link features. The only difference between the two networks is the fact that anaphora resolution is now performed by a part of our neural network itself instead of being done by an external module and provided to the classifier as an input. In this setup, we still use some parts of the BART toolkit to extract markables and compute features. However, we do not make use of the machine learning component in BART that makes the actual predictions. Since this is the only component trained on coreference-annotated data in a typical BART configuration, no coreference annotations are used anywhere in our system even though we continue to rely on the external anaphora resolver for preprocessing to avoid implementing our own markable and feature extractors and to make comparison easier. For each candidate markable identified by BART’s preprocessing pipeline, the anaphora resolution model receives as input a link feature vector (T) describing relevant aspects of the antecedent candidateanaphora pair. This feature vector is generated by the feature extraction machinery in BART and includes a standard feature set for coreference resolution partially based on work by Soon et al. (2001). We use the following feature extractors in BART, each of Figure 4: Neural network with latent anaphora resolution which can generate multiple features: Anaphora mention type Gender match Number match String match Alias feature (Soon et al., 2001) Appositive position feature (Soon et al., 2001) Semantic class (Soon et al., 2001) – – – – – – – Semantic class match Binary distance feature Antecedent is first mention in sentence Our baseline set of features was borrowed wholesale from a working coreference system and includes some features that are not relevant to the task at hand, e. g., features indicating that the anaphora is a pronoun, is not a named entity, etc. After removing all features that assume constant values in the training set when resolving antecedents for the set of pronouns we consider, we are left with a basic set of 37 anaphoric link features that are fed as inputs to our network. These features are exactly the same as those available to the anaphora resolution classifier in the BART system used in the previous section. Each training example for our network can have an arbitrary number of antecedent candidates, each of which is described by an antecedent word vector (A) and by an anaphoric link vector (T). The anaphoric link features are first mapped to a regular hidden layer with logistic sigmoid units (U). The activations of the hidden units are then mapped to a single value, which – – – 386 functions as an element in a softmax layer over all an- tecedent candidates (V). This softmax layer assigns a probability to each antecedent candidate, which we then use to compute a weighted average over the antecedent word vector, replacing the probabilities pi in Figures 2 and 3. At training time, the network’s anaphora resolution component is trained in exactly the same way as the rest of the network. The error signal from the embedding layer is backpropagated both to the weight matrix defining the antecedent word embedding and to the anaphora resolution subnetwork. Note that the number of weights in the network is the same for all training examples even though the number of antecedent candidates varies because all weights related to antecedent word features and anaphoric link features are shared between all antecedent candidates. One slightly uncommon feature of our neural network is that it contains an internal softmax layer to generate normalised probabilities over all possible antecedent candidates. Moreover, weights are shared between all antecedent candidates, so the inputs of our internal softmax layer share dependencies on the same weight variables. When computing derivatives with backpropagation, these shared dependen- cies must be taken into account. In particular, the outputs yi ofthe antecedent resolution layer are the result of a softmax applied to functions of some shared variables q: yi=∑kexepxp fi( fkq()q) (1) The derivatives of any yi with respect to q, which can be any of the weights in the anaphora resolution subnetwork, have dependencies on the derivatives of the other softmax inputs with respect to q: ∂∂yqi= yi ∂ f∂i(qq)−∑kyk∂ f∂k(qq)! (2) This makes the implementation of backpropagation for this part of the network somewhat more complicated, but in the case of our networks, it has no major impact on training time. Experimental results for this network are shown in Table 4. Compared with Table 3, we note that the overall accuracy is only very slightly lower for TED, and for the news commentaries it is actually better. When it comes to F-scores, the performance for elles improves by a small amount, while the effect on the other classes is a bit more mixed. Even where it gets worse, the differences are not dramatic considering that we eliminated a very knowledge-rich resource from the training process. This demonstrates that it is possible, in our classification task, to obtain good results without using any data manually annotated for anaphora and to rely entirely on unsupervised latent anaphora resolution. 7 Further Improvements The results presented in the preceding section represent a clear improvement over the ME classifiers in Table 2, even though the overall accuracy increased only slightly. Not only does our neural network classifier achieve better results on the classification task at hand without requiring an anaphora resolution classifier trained on manually annotated data, but it performs clearly better for the feminine categories that reflect minority choices requiring knowledge about the antecedents. Nevertheless, the performance is still not entirely satisfactory. By subjecting the output of our classifier on a development set to a manual error analysis, we found that a fairly large number oferrors belong to two error types: On the one hand, the preprocessing pipeline used to identify antecedent candidates does not always include the correct antecedent in the set presented to the neural network. Whenever this occurs, it is obvious that the classifier cannot possibly find 387 the correct antecedent. Out of 76 examples of the category elles that had been mistakenly predicted as ils, we found that 43 suffered from this problem. In other classes, the problem seems to be somewhat less common, but it still exists. On the other hand, in many cases (23 out of 76 for the category mentioned before) the anaphora resolution subnetwork does identify an antecedent manually recognised to belong to the right gender/number group, but still predicts an incorrect pronoun. This may indicate that the network has difficulties learning a correct gender/number representation for all words in the vocabulary. 7.1 Relaxing Markable Extraction The pipeline we use to extract potential antecedent candidates is borrowed from the BART anaphora resolution toolkit. BART uses a syntactic parser to identify noun phrases as markables. When extracting antecedent candidates for coreference prediction, it starts by considering a window consisting of the sentence in which the anaphoric pronoun is located and the two immediately preceding sentences. Markables in this window are checked for morphological compatibility in terms of gender and number with the anaphoric pronoun, and only compatible markables are extracted as antecedent candidates. If no compatible markables are found in the initial window, the window is successively enlarged one sentence at a time until at least one suitable markable is found. Our error analysis shows that this procedure misses some relevant markables both because the initial two-sentence extraction window is too small and because the morphological compatibility check incorrectly filters away some markables that should have been considered as candidates. By contrast, the extraction procedure does extract quite a number of first and second person noun phrases (I, we, you and their oblique forms) in the TED talks which are extremely unlikely to be the antecedent of a later occurrence of he, she, it or they. As a first step, we therefore adjust the extraction criteria to our task by increasing the initial extraction window to five sentences, excluding first and second person markables and removing the morphological compatibility requirement. The compatibility check is still used to control expansion of the extraction window, but it is no longer applied to filter the extracted markables. This increases the accuracy to 0.701 for TED and 0.602 for the news TED (Accuracy: 0.696) P R ce 0.618 0.722 elle 0.754 0.548 elles 0.737 0.340 il 0.718 0.629 ils 0.652 0.916 OTHER 0.741 0.682 F 0.666 0.635 0.465 0.670 0.761 0.711 News commentary (Accuracy: 0.597) ce elle elles il ils OTHER P 0.419 0.547 0.539 0.623 0.596 0.614 R 0.368 0.460 0.135 0.719 0.783 0.544 F 0.392 0.500 0.215 0.667 0.677 0.577 Table 4: Neural network classifier with latent anaphora resolution TED (Accuracy: 0.713) ce elle P 0.61 1 0.749 R 0.723 0.596 F 0.662 0.664 elles 0.602 0.616 il 0.733 0.638 ils 0.710 0.884 OTHER 0.760 0.704 News commentary (Accuracy: 0.626) ce elle elles il ils OTHER P 0.492 0.526 0.547 0.599 0.671 0.681 Table 5: Final classifier R 0.324 0.439 0.558 0.757 0.878 0.526 0.609 0.682 0.788 0.731 F 0.391 0.478 0.552 0.669 0.761 0.594 results commentaries, while the performance for elles im- proves to F-scores of 0.531 (TED; P 0.690, R 0.432) and 0.304 (News commentaries; P 0.444, R 0.231), respectively. Note that these and all the following results are not directly comparable to the ME baseline results in Table 2, since they include modifications and improvements to the training data extraction procedure that might possibly lead to benefits in the ME setting as well. 7.2 Adding Lexicon Knowledge In order to make it easier for the classifier to identify the gender and number properties of infrequent words, we extend the word vectors with features indicating possible morphological features for each word. In early experiments with ME classifiers, we found that our attempts to do proper gender and number tagging in French text did not improve classification performance noticeably, presumably because the annotation was too noisy. In more recent experiments, we just add features indicating all possible morphological interpretations of each word, rather than trying to disambiguate them. To do this, we look up the morphological annotations of the French words in the Lefff dictionary (Sagot et al., 2006) and intro- 388 duce a set of new binary features to indicate whether a particular reading of a word occurs in that dictionary. These features are then added to the one-hot representation of the antecedent words. Doing so improves the classifier accuracy to 0.71 1 (TED) and 0.604 (News commentaries), while the F-scores for elles reach 0.589 (TED; P 0.649, R 0.539) and 0.500 (News commentaries; P 0.545, R 0.462), respectively. 7.3 More Anaphoric Link Features Even though the modified antecedent candidate extraction with its larger context window and without the morphological filter results in better performance on both test sets, additional error analysis reveals that the classifiers has greater problems identifying the correct markable in this setting. One reason for this may be that the baseline anaphoric link feature set described above (Section 6) only includes two very rough binary distance features which indicate whether or not the anaphora and the antecedent candidate occur in the same or in immediately adjacent sentences. With the larger context window, this may be too unspecific. In our final experiment, we there- fore enable some additional features which are available in BART, but disabled in the baseline system: Distance in number of markables Distance in number of sentences Sentence distance, log-transformed Distance in number of words Part of speech of head word Most of these encode the distance between the anaphora and the antecedent candidate in more precise ways. Complete results for this final system are presented in Table 5. Including these additional features leads to another slight increase in accuracy for both corpora, with similar or increased classifier F-scores for most classes except elle in the news commentary condition. In particular, we should like to point out the performance of our benchmark classifier for elles, which suffered from extremely low recall in the first classifiers and approaches the performance ofthe other classes, with nearly balanced precision and recall, in this final system. Since elles is a low-frequency class and cannot be reliably predicted using source context alone, we interpret this as evidence that our final neural network classifier has incorporated some relevant knowledge about pronominal anaphora that the baseline ME clas– – – – – sifier and earlier versions of our network have no access to. This is particularly remarkable because no data manually annotated for coreference was used for training. 8 Related work Even though it was recognised years ago that the information contained in parallel corpora may provide valuable information for the improvement of anaphora resolution systems, there have not been many attempts to cash in on this insight. Mitkov and Barbu (2003) exploit parallel data in English and French to improve pronominal anaphora resolution by combining anaphora resolvers for the individual languages with handwritten rules to resolve conflicts between the output of the language-specific resolvers. Veselovská et al. (2012) apply a similar strategy to English-Czech data to resolve different uses of the pronoun it. Other work has used word alignments to project coreference annotations from one language to another with a view to training anaphora resolvers in the target language (Postolache et al., 2006; de Souza and Or˘ asan, 2011). Rahman and Ng (2012) instead use machine translation to translate their test 389 data into a language for which they have an anaphora resolver and then project the annotations back to the original language. Completely unsupervised monolingual anaphora resolution has been approached using, e. g., Markov logic (Poon and Domingos, 2008) and the Expectation-Maximisation algorithm (Cherry and Bergsma, 2005; Charniak and Elsner, 2009). To the best of our knowledge, the direct application of machine learning techniques to parallel data in a task related to anaphora resolution is novel in our work. Neural networks and deep learning techniques have recently gained some popularity in natural language processing. They have been applied to tasks such as language modelling (Bengio et al., 2003; Schwenk, 2007), translation modelling in statistical machine translation (Le et al., 2012), but also part-ofspeech tagging, chunking, named entity recognition and semantic role labelling (Collobert et al., 2011). In tasks related to anaphora resolution, standard feedforward neural networks have been tested as a classifier in an anaphora resolution system (Stuckardt, 2007), but the network design presented in our work is novel. 9 Conclusion In this paper, we have introduced cross-lingual pronoun prediction as an independent natural language processing task. Even though it is not an end-to-end task, pronoun prediction is interesting for several reasons. It is related to the problem of pronoun translation in SMT, a currently unsolved problem that has been addressed in a number of recent research publications (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) without reaching a majorbreakthrough. In this work, we have shown that pronoun prediction can be effectively modelled in a neural network architecture with relatively simple features. More importantly, we have demonstrated that the task can be exploited to train a classifier with a latent representation of anaphoric links. With parallel text as its only supervision this classifier achieves a level of performance that is similar to, if not better than, that of a classifier using a regular anaphora resolution system trained with manually annotated data. References Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1 155. Samuel Broscheit, Massimo Poesio, Simone Paolo Ponzetto, Kepa Joseba Rodriguez, Lorenza Romano, Olga Uryupina, Yannick Versley, and Roberto Zanoli. 2010. BART: A multilingual anaphora resolution system. In Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2010), Uppsala, Sweden, 15–16 July 2010. Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Associationfor Machine Translation (EAMT), pages 261–268, Trento, Italy. Eugene Charniak and Micha Elsner. 2009. EM works for pronoun anaphora resolution. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 148–156, Athens, Greece. Colin Cherry and Shane Bergsma. 2005. An Expectation Maximization approach to pronoun resolution. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 88– 95, Ann Arbor, Michigan. Michael Collins. 1999. Head-Driven Statistical Models forNatural Language Parsing. Ph.D. thesis, University of Pennsylvania. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal ofMachine Learning Research, 12:2461–2505. José de Souza and Constantin Or˘ asan. 2011. Can projected chains in parallel corpora help coreference resolution? In Iris Hendrickx, Sobha Lalitha Devi, António Branco, and Ruslan Mitkov, editors, Anaphora Processing and Applications, volume 7099 of Lecture Notes in Computer Science, pages 59–69. Springer, Berlin. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Associationfor Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Dan Klein and Christopher D. Manning. 390 2003. Accu- rate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Associationfor Computational Linguistics, pages 423–430, Sapporo, Japan. Hai-Son Le, Alexandre Allauzen, and François Yvon. 2012. Continuous space translation models with neural networks. In Proceedings ofthe 2012 Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 39–48, Montréal, Canada. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Ruslan Mitkov and Catalina Barbu. 2003. Using bilingual corpora to improve pronoun resolution. Languages in Contrast, 4(2):201–21 1. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29: 19–51. Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 650– 659, Honolulu, Hawaii. Oana Postolache, Dan Cristea, and Constantin Or˘ asan. 2006. Transferring coreference chains through word alignment. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC-2006), pages 889–892, Genoa. Altaf Rahman and Vincent Ng. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the 2012 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 720– 730, Montréal, Canada. Benoît Sagot, Lionel Clément, Éric Villemonte de La Clergerie, and Pierre Boullier. 2006. The Lefff 2 syntactic lexicon for French: architecture, acquisition, use. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC2006), pages 1348–1351, Genoa. Holger Schwenk. 2007. Continuous space language models. Computer Speech and Language, 21(3):492–5 18. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4):521–544. Roland Stuckardt. 2007. Applying backpropagation networks to anaphor resolution. In António Branco, editor, Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Collo- 2007, number 4410 in Lecture Notes in Artificial Intelligence, pages 107–124, Berlin. Kate ˇrina Veselovská, Ngu.y Giang Linh, and Michal Novák. 2012. Using Czech-English parallel corpora in quium, DAARC automatic identification of it. In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pages 112–120, Istanbul, Turkey. 391

6 0.78150761 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

7 0.77683151 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching

8 0.77439749 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

9 0.77389169 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

10 0.7724492 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

11 0.77060676 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution

12 0.76930654 152 emnlp-2013-Predicting the Presence of Discourse Connectives

13 0.76772636 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

14 0.76590711 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

15 0.76511008 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

16 0.76480496 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

17 0.76411712 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

18 0.76277804 108 emnlp-2013-Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data

19 0.76219064 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

20 0.76173055 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment