emnlp emnlp2013 emnlp2013-31 knowledge-graph by maker-knowledge-mining

31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction


Source: pdf

Author: Aliaksei Severyn ; Alessandro Moschitti

Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 it Abstract This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. [sent-3, score-1.725]

2 We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e. [sent-4, score-1.097]

3 We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. [sent-8, score-1.522]

4 , up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering. [sent-11, score-0.723]

5 1 Introduction Question Answering (QA) systems are typically built from three main macro-modules: (i) search and retrieval of candidate passages; (ii) reranking or selection of the most promising passages; and (iii) answer extraction. [sent-12, score-0.899]

6 Answer sentence selection refers to the task of selecting the sentence containing the correct answer 458 Alessandro Moschitti Qatar Computing Research Institue 5825 Doha, Qatar amos chitt i qf org qa @ . [sent-14, score-0.988]

7 The definition of rules for both tasks is conceptually demanding and involves the use of syntactic and semantic properties of the questions and its related answer passages. [sent-20, score-0.89]

8 a rule detecting the semantic links between Johnny Appleseed’s real name and the correct answer John Chapman in the answer sentence has to be engineered. [sent-25, score-1.594]

9 ( X ) with real name i s ( X ) of the answer sentence. [sent-27, score-0.725]

10 In this paper, we show that tree kernels (Collins and Duffy, 2002; Moschitti, 2006) can be applied to automatically learn complex structural patterns for both answer sentence selection and answer extraction. [sent-41, score-1.903]

11 Such patterns are syntactic/semantic structures occurring in question and answer passages. [sent-42, score-0.966]

12 More in detail, we (i) design a pair of shallow syntactic trees (one for the question and one for the answer sentence); (ii) connect them with relational nodes (i. [sent-50, score-1.272]

13 , those matching the same words in the question and in the answer passages); (iii) label the tree nodes with semantic information such as question category and focus and NEs; and (iv) use the NE type to establish additional semantic links between the candidate answer, i. [sent-52, score-1.61]

14 Finally, for the task of answer extraction we also connect such semantic information to the answer sentence trees such that we can learn factoid answer patterns. [sent-55, score-2.414]

15 We show that our models are very effective in producing features for both answer selection and extraction by experimenting with TREC QA corpora and directly comparing with the state of the art, e. [sent-56, score-0.859]

16 The results show that our methods greatly improve on both 459 tasks yielding a large improvement in Mean Average Precision for answer selection and in F1 for answer extraction: up to 22% of relative improvement in F1, when small training data is used. [sent-61, score-1.483]

17 4 describes our model for answer selection and extraction, Sec. [sent-68, score-0.783]

18 rTeh seu plaptoterrt can measure th(e·, similarity between question and answer pairs. [sent-85, score-0.903]

19 We define each question/answer pair xx as a triple composed of a question tree T TTq and answer sentence tree TT s and a similarity feature vector vv , i. [sent-86, score-1.338]

20 In contrast, we solve the lack of feature pairing by annotating the trees with relational tags which are supposed to link the question tree fragments with the related fragments from the answer sentence. [sent-101, score-1.342]

21 In the next sec- tion, we show simple structural models that we used in our experiments for question and answer pair classification. [sent-103, score-1.005]

22 It generalizes the syntactic tree kernel (STK) (Collins and Duffy, 2002), which maps a tree into the space of all possible tree fragments constrained by the rule that sibling nodes cannot be separated. [sent-108, score-0.656]

23 1 Shallow syntactic tree Our shallow tree structure is a two-level syntactic hierarchy built from word lemmas (leaves), part-ofspeech tags that organized into chunks identified by a shallow syntactic parser (Fig. [sent-117, score-0.724]

24 We defined a similar structure in (Severyn and Moschitti, 2012) for answer passage reranking, which improved on feature vector baselines. [sent-119, score-0.748]

25 This simple linguistic representation is suitable for building a rather expressive answer sentence selection model. [sent-120, score-0.849]

26 Moreover, the use of a shallow parser is motivated by the need to generate text spans to produce candidate answers required by an answer extraction system. [sent-121, score-1.083]

27 2 Tree pairs enriched with relational links It is important to establish a correspondence between question and answer sentence aligning related concepts from both. [sent-123, score-1.195]

28 We take on a two-level approach, where we first use plain lexical matching to connect common lemmas from the question and its candidate answer sentence. [sent-124, score-1.035]

29 Secondly, we establish semantic links between NEs extracted from the answer sentence and the question focus word, which encodes the expected lexical answer type (LAT). [sent-125, score-1.818]

30 Dashed arrows (red) indicate the tree (red dashed boxes) in the question and its answer sentence linked by the relational REL tag, which is established via syntactic match on the word lemmas. [sent-128, score-1.317]

31 Solid arrows (blue) connect a question focus word name with the related named entities of type Person corresponding to the question category (HUM) via a relational tag REL-HUM. [sent-129, score-0.786]

32 Additional ANS tag is used to mark chunks containing candidate answer (here the correct answer John Chapman). [sent-130, score-1.613]

33 Our question classification model is simpler than before: we use an SVM multi- classifier with tree kernels to automatically extract the question class. [sent-145, score-0.714]

34 This set of question categories is sufficient to capture the coarse semantic answer type of the candidate answers found in TREC. [sent-151, score-1.204]

35 Question focus word specifies the lexical answer type capturing the target information need posed by a question, but to make this piece of information effective, the focus word needs to be linked to the target candidate answer. [sent-154, score-0.92]

36 the answer sentence, or the match can be established using semantic information. [sent-156, score-0.752]

37 Hence, we propose to exploit a question focus along with the related named entities (according to the mapping from Table 1) of the answer sentence to establish relational links between the tree frag- ments. [sent-160, score-1.423]

38 In particular, once the question focus and question category are determined, we link the focus word wfocus in the question, with all the named entities whose type matches the question class (Table 1). [sent-161, score-0.811]

39 1 shows an example q/a pair where the typed relational tag is used in the shallow syntactic tree representation to link the chunk containing the question focus name with the named entities of the corresponding type Person, i. [sent-166, score-0.945]

40 4 Answer Sentence Selection and Answer Keyword Extraction This section describes our approach to (i) answer sentence selection used to select the most promising answer sentences; and (ii) answer extraction which returns the answer keyword (for factoid questions). [sent-169, score-3.07]

41 1 Answer Sentence Selection We cast the task of answer sentence selection as a classification problem. [sent-171, score-0.822]

42 if Uiesri nmgo dtheisl t loa predict itfa a given pair toof a question and an answer sentence is correct or not. [sent-174, score-0.992]

43 We train a binary SVM with tree kernels3 to train an answer sentence classifier. [sent-175, score-0.891]

44 The prediction scores obtained from a classifier are used to rerank the answer candidates (pointwise reranking), s. [sent-176, score-0.77]

45 2 Answer Sentence Extraction The goal of answer extraction is to extract a text span from a given candidate answer sentence. [sent-184, score-1.551]

46 Such span represents a correct answer phrase for a given ques- tion. [sent-185, score-0.75]

47 Different from previous work that casts the answer extraction task as a tagging problem and apply a CRF to learn an answer phrase tagger (Yao et al. [sent-186, score-1.476]

48 In particular, for each example representing a triple ha, Tq, Tsi composed of tahme answer a, nthtien question a hnad, Tthe answer speonsteednc oef trees, we generate a set of training examples E with every candidate chunk marked with an ANS tag (one at a time). [sent-190, score-1.806]

49 To reduce the number of generated examples for each answer sentence, we only consider NP chunks, since other types of chunks, e. [sent-191, score-0.7]

50 Finally, an original untagged tree is used to generate a positive example (line 8), when the answer sentence contains a correct answer, and a negative example (line 10), when it does not contain a correct answer. [sent-194, score-0.991]

51 At the classification time, given a question and a candidate answer sentence, all NP nodes of the sen- tence are marked with ANS (one at a time) as the possible answer, generating a set of tree candidates. [sent-195, score-1.13]

52 5 Experiments We provide the results on two related yet different tasks: answer sentence selection and answer extraction. [sent-199, score-1.522]

53 The goal of the former is to learn a model scoring correct question and answer sentence pairs to bring in the top positions sentences containing the correct answers. [sent-200, score-1.042]

54 The first two datasets are very domain specific, while the dataset from (Bunescu and Huang, 2010) is more generic containing the first 2,000 questions from the answer type dataset from Li and Roth annotated with focus words. [sent-214, score-0.849]

55 , 2007) to enable direct comparison with previous work on answer sentence selection. [sent-228, score-0.739]

56 The data provided for training comes as two sets: a small set of 94 questions (TRAIN) that were manually curated for errors5 and 1,229 questions from the entire TREC 8-12 that contain at least one correct answer sentence (ALL). [sent-230, score-0.967]

57 The latter set represents a more noisy setting, since many answer sentences are marked erroneously as correct as they simply match a regular expression. [sent-231, score-0.75]

58 Table 4 compares our kernel-based structural model with the previous state-of-the-art systems for answer sentence selection. [sent-233, score-0.841]

59 In particular, we compare with four most recent state of the art answer sentence reranker models (Wang et al. [sent-234, score-0.761]

60 , question focus and question category classifiers coupled with NERs, to establish semantic mapping between words in a q/a pair. [sent-243, score-0.613]

61 5In TREC correct answers are identified by regex matching using the provided answer pattern files Table 3: Summary of TREC data for answer extraction used in (Yao et al. [sent-244, score-1.662]

62 3 Answer Extraction Our experiments on answer extraction replicate the setting of (Yao et al. [sent-268, score-0.798]

63 , 2013), which is the most recent work on answer extraction reporting state-of-the-art results. [sent-269, score-0.776]

64 Here the focus is on the ability of an answer extraction system to recuperate as many correct answers as possible from each answer sentence candidate. [sent-271, score-1.761]

65 Recall (R) encodes the percentage of correct answer sentences for which the system correctly extracts an answer (for TREC 13 there are a total of 284 correct answer sentences), while Precision (P) reflects how many answers extracted by the system are actually correct. [sent-273, score-2.336]

66 On the other hand, a high precision system would attempt to answer less questions (extracting no answers at all) but get them right. [sent-275, score-0.949]

67 Unlike the CRF model which obtains higher values of precision, our system acts as a high recall system able to recover most of the answers from the correct answer sentences. [sent-278, score-0.911]

68 Having higher recall is favorable to high precision in answer extraction since producing more correct answers can help in the final voting scheme to come up with a single best answer. [sent-279, score-1.099]

69 (2013) apply fairly complex outlier resolution techniques to force answer predictions, thus aiming at increasing the number of extracted answers. [sent-281, score-0.722]

70 Nevertheless, it has a substantial effect on the number of questions that can be answered correctly (assuming perfect single best answer selection). [sent-283, score-0.789]

71 Clearly, our system is able to recover a large number of answers from the correct answer sentences, while low precision, i. [sent-284, score-0.886]

72 , extracting answer candidates from sentences that do not contain a correct answer, can be overcome by further applying various best answer selection strategies, which we explore in the next section. [sent-286, score-1.574]

73 4 Best Answer Selection Since the final step of the answer extraction module is to select for each question a single best answer from a set of extracted candidate answers, an answer selection scheme is required. [sent-288, score-2.562]

74 We adopt a simple majority voting strategy, where we aggregate the extracted answers produced by our answer extraction model. [sent-289, score-0.975]

75 P/R - precision and recall; pairs - number of QA pairs with a correctly extracted answer, q - number of questions with at least one correct answer extracted, F1 sets an upper bound on the performance assuming the selected best answer among extracted candidates is always correct. [sent-295, score-1.604]

76 *-marks the setting where we exclude incorrect question answer pairs from training. [sent-296, score-0.903]

77 0 Table 6: Results on finding the best answer with voting. [sent-319, score-0.7]

78 Table 6 shows the results after the majority vot- ing is applied to select a single best answer for each candidate. [sent-341, score-0.7]

79 465 6 Discussion and Error Analysis There are several sources of errors affecting the final performance of our answer extraction system: (i) chunking, (ii) named entity recognition and semantic linking, (iii) answer extraction, (iv) single best answer selection. [sent-345, score-2.253]

80 Our system uses text spans identified by a chunker to extract answer candidates, which makes it impossible to extract answers that lie outside the chunk boundaries. [sent-347, score-0.931]

81 While our answer extraction model is working on all the NP chunks, the semantic tags from NER serve as a strong cue for the classifier that a given chunk has a high probability of containing an answer. [sent-351, score-0.952]

82 , 2011), is required to boost the quality of candidates considered for answer extraction. [sent-358, score-0.741]

83 Our answer extraction model acts as a high recall system, while it suffers from low precision in extracting answers for many incorrect sentences. [sent-361, score-0.961]

84 Improving the precision without sacrificing the recall would ease the successive task of best answer selection, since having less incorrect an- swer candidates would result in a better final performance. [sent-362, score-0.83]

85 Introducing additional constraints in the form of semantic tags to allow for better selection of answer candidates could also improve our system. [sent-363, score-0.899]

86 We apply a na¨ ıve majority voting scheme to select a single best answer from a set of extracted answer candidates. [sent-365, score-1.488]

87 This step has a dramatic impact on the final performance of the answer extraction system resulting in a large drop of recall, i. [sent-366, score-0.776]

88 , performing joint answer sentence re-ranking and answer extraction, is required to yield a better performance. [sent-373, score-1.439]

89 7 Related Work Tree kernel methods have found many applications for the task of answer reranking which are reported in (Moschitti, 2008; Moschitti, 2009; Moschitti and Quarteroni, 2008; Severyn and Moschitti, 2012). [sent-374, score-0.845]

90 However, their methods lack the use of important relational information between a question and a candidate answer, which is essential to learn accurate relational patterns. [sent-375, score-0.576]

91 In contrast, this paper relies on structures directly encoding the output of question and focus classifiers to connect focus word and good candidate answer keywords (represented by NEs) of the answer passage. [sent-380, score-1.95]

92 Additionally, previous work on kernel-based approaches does not target answer extraction. [sent-382, score-0.7]

93 One of the best models for answer sentence selection has been proposed in (Wang et al. [sent-383, score-0.822]

94 They use the paradigm of quasi-synchronous grammar to model relations between a question and a candidate answer with syntactic transformations. [sent-385, score-1.027]

95 Furthermore, semantically enriched relational structures, where automatic have been previously explored for answer passage reranking in (Severyn et al. [sent-399, score-0.978]

96 This paper demonstrates that this model also works for building a reranker on the sentence level, and extends the previous work by applying the idea of automatic feature engineering with tree kernels to answer extraction. [sent-402, score-1.04]

97 , shallow, constituency, dependency trees, representing question and its candidate answer sentences and let the kernel learning framework learn to use discriminative tree fragments for the target task. [sent-410, score-1.281]

98 The comparison with previous work on a public benchmark from TREC suggests that our approach is very promising as we can improve the state of the art in both answer selection and extraction by a large margin (up to 22% of relative improvement in F1 for answer extraction). [sent-416, score-1.559]

99 To achieve state-of-the-art results in answer sentence selection and answer extraction, it is sufficient to provide our model with a suitable tree structure encoding relevant syntactic information, e. [sent-418, score-1.723]

100 Towards a general model of answer typing: Question focus identification. [sent-427, score-0.76]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('answer', 0.7), ('severyn', 0.228), ('question', 0.203), ('trec', 0.161), ('tree', 0.152), ('relational', 0.149), ('answers', 0.136), ('kernels', 0.127), ('yao', 0.122), ('ptk', 0.106), ('kernel', 0.104), ('structural', 0.102), ('moschitti', 0.102), ('shallow', 0.096), ('chunk', 0.095), ('xx', 0.092), ('questions', 0.089), ('selection', 0.083), ('qa', 0.077), ('extraction', 0.076), ('aliaksei', 0.076), ('candidate', 0.075), ('fc', 0.074), ('factoid', 0.072), ('alessandro', 0.069), ('zanzotto', 0.063), ('structures', 0.063), ('voting', 0.063), ('appleseed', 0.061), ('ktk', 0.061), ('quarteroni', 0.061), ('focus', 0.06), ('chunks', 0.055), ('nes', 0.053), ('semantic', 0.052), ('correct', 0.05), ('syntactic', 0.049), ('passage', 0.048), ('fragments', 0.047), ('bunescu', 0.046), ('nicosia', 0.046), ('trees', 0.044), ('candidates', 0.041), ('reranking', 0.041), ('enriched', 0.04), ('damljanovic', 0.04), ('swer', 0.04), ('sentence', 0.039), ('coarse', 0.038), ('heilman', 0.037), ('establish', 0.036), ('wang', 0.034), ('constituency', 0.033), ('linking', 0.033), ('textual', 0.033), ('tag', 0.033), ('classifiers', 0.033), ('entities', 0.031), ('connect', 0.031), ('currency', 0.03), ('ferrucci', 0.03), ('fwornced', 0.03), ('johnny', 0.03), ('kv', 0.03), ('multiwords', 0.03), ('prager', 0.03), ('qatar', 0.03), ('uiuic', 0.03), ('classifier', 0.029), ('edit', 0.028), ('links', 0.028), ('forced', 0.028), ('passages', 0.028), ('ans', 0.028), ('iii', 0.027), ('representation', 0.027), ('chapman', 0.027), ('category', 0.026), ('hum', 0.026), ('roth', 0.026), ('lemmas', 0.026), ('scheme', 0.025), ('named', 0.025), ('name', 0.025), ('linked', 0.025), ('relies', 0.025), ('recall', 0.025), ('massimo', 0.024), ('duffy', 0.024), ('entailments', 0.024), ('tq', 0.024), ('encode', 0.024), ('constituent', 0.024), ('precision', 0.024), ('additional', 0.023), ('entailment', 0.023), ('replicate', 0.022), ('reranker', 0.022), ('outlier', 0.022), ('constituting', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

Author: Aliaksei Severyn ; Alessandro Moschitti

Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.

2 0.21041392 180 emnlp-2013-The Answer is at your Fingertips: Improving Passage Retrieval for Web Question Answering with Search Behavior Data

Author: Mikhail Ageev ; Dmitry Lagun ; Eugene Agichtein

Abstract: Passage retrieval is a crucial first step of automatic Question Answering (QA). While existing passage retrieval algorithms are effective at selecting document passages most similar to the question, or those that contain the expected answer types, they do not take into account which parts of the document the searchers actually found useful. We propose, to the best of our knowledge, the first successful attempt to incorporate searcher examination data into passage retrieval for question answering. Specifically, we exploit detailed examination data, such as mouse cursor movements and scrolling, to infer the parts of the document the searcher found interesting, and then incorporate this signal into passage retrieval for QA. Our extensive experiments and analysis demonstrate that our method significantly improves passage retrieval, compared to using textual features alone. As an additional contribution, we make available to the research community the code and the search behavior data used in this study, with the hope of encouraging further research in this area.

3 0.18424928 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations

Author: Shashank Srivastava ; Dirk Hovy ; Eduard Hovy

Abstract: In this paper, we propose a walk-based graph kernel that generalizes the notion of treekernels to continuous spaces. Our proposed approach subsumes a general framework for word-similarity, and in particular, provides a flexible way to incorporate distributed representations. Using vector representations, such an approach captures both distributional semantic similarities among words as well as the structural relations between them (encoded as the structure of the parse tree). We show an efficient formulation to compute this kernel using simple matrix operations. We present our results on three diverse NLP tasks, showing state-of-the-art results.

4 0.18405235 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

Author: Matthew Richardson ; Christopher J.C. Burges ; Erin Renshaw

Abstract: Christopher J.C. Burges Microsoft Research One Microsoft Way Redmond, WA 98052 cburge s @micro so ft . com Erin Renshaw Microsoft Research One Microsoft Way Redmond, WA 98052 erinren@mi cros o ft . com disciplines are focused on this problem: for example, information extraction, relation extraction, We present MCTest, a freely available set of stories and associated questions intended for research on the machine comprehension of text. Previous work on machine comprehension (e.g., semantic modeling) has made great strides, but primarily focuses either on limited-domain datasets, or on solving a more restricted goal (e.g., open-domain relation extraction). In contrast, MCTest requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Reading comprehension can test advanced abilities such as causal reasoning and understanding the world, yet, by being multiple-choice, still provide a clear metric. By being fictional, the answer typically can be found only in the story itself. The stories and questions are also carefully limited to those a young child would understand, reducing the world knowledge that is required for the task. We present the scalable crowd-sourcing methods that allow us to cheaply construct a dataset of 500 stories and 2000 questions. By screening workers (with grammar tests) and stories (with grading), we have ensured that the data is the same quality as another set that we manually edited, but at one tenth the editing cost. By being open-domain, yet carefully restricted, we hope MCTest will serve to encourage research and provide a clear metric for advancement on the machine comprehension of text. 1 Reading Comprehension A major goal for NLP is for machines to be able to understand text as well as people. Several research 193 semantic role labeling, and recognizing textual entailment. Yet these techniques are necessarily evaluated individually, rather than by how much they advance us towards the end goal. On the other hand, the goal of semantic parsing is the machine comprehension of text (MCT), yet its evaluation requires adherence to a specific knowledge representation, and it is currently unclear what the best representation is, for open-domain text. We believe that it is useful to directly tackle the top-level task of MCT. For this, we need a way to measure progress. One common method for evaluating someone’s understanding of text is by giving them a multiple-choice reading comprehension test. This has the advantage that it is objectively gradable (vs. essays) yet may test a range of abilities such as causal or counterfactual reasoning, inference among relations, or just basic understanding of the world in which the passage is set. Therefore, we propose a multiple-choice reading comprehension task as a way to evaluate progress on MCT. We have built a reading comprehension dataset containing 500 fictional stories, with 4 multiple choice questions per story. It was built using methods which can easily scale to at least 5000 stories, since the stories were created, and the curation was done, using crowd sourcing almost entirely, at a total of $4.00 per story. We plan to periodically update the dataset to ensure that methods are not overfitting to the existing data. The dataset is open-domain, yet restricted to concepts and words that a 7 year old is expected to understand. This task is still beyond the capability of today’s computers and algorithms. ProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 1t9ic3s–203, By restricting the concept space, we gain the difficulty of being an open-domain problem, without the full complexity of the real world (for example, there will be no need for the machine to understand politics, technology, or to have any domain specific expertise). The multiple choice task avoids ambiguities (such as when the task is to find a sentence that best matches a question, as in some early reading comprehension tasks: see Section 2), and also avoids the need for additional grading, such as is needed in some TREC tasks. The stories were chosen to be fictional to focus work on finding the answer in the story itself, rather than in knowledge repositories such as Wikipedia; the goal is to build technology that actually understands stories and paragraphs on a deep level (as opposed to using information retrieval methods and the redundancy of the web to find the answers). We chose to use crowd sourcing, as opposed to, for example, contracting teachers or paying for existing standardized tests, for three reasons, namely: (1) scalability, both for the sizes of datasets we can provide, and also for the ease of regularly refreshing the data; (2) for the variety in story-telling that having many different authors brings; and (3) for the free availability that can only result from providing non-copyrighted data. The content is freely available at http://research.microsoft.com/mct, and we plan to use that site to track published results and provide other resources, such as labels of various kinds. 2 Previous Work The research goal of mapping text to meaning representations in order to solve particular tasks has a long history. DARPA introduced the Airline Travel Information System (ATIS) in the early 90’s: there the task was to slot-fill flight-related information by modeling the intent of spoken language (see Tur et al., 2010, for a review). This data continues to be a used in the semantic modeling community (see, for example, Zettlemoyer and Collins, 2009). The Geoquery database contains 880 geographical facts about the US and has played a similar role for written (as opposed to spoken) natural language queries against a database (Zelle and Mooney, 1996) and it also continues to spur research (see for example Goldwasser et al., 2011), as does the similar Jobs database, which provides mappings of 640 sentences to a listing of jobs 194 (Tang and Mooney, 2001). More recently, Zweig and Burges (2012) provided a set of 1040 sentences that comprise an SAT-style multiple choice sentence completion task. The idea of using story-based reading comprehension questions to evaluate methods for machine reading itself goes back over a decade, when Hirschmann et al. (1999) showed that a bag of words approach, together with some heuristic linguistic modeling, could achieve 40% accuracy for the task of picking the sentence that best matches the query for “who / what / when / where / why” questions, on a small reading comprehension dataset from Remedia. This dataset spurred several research efforts, for example using reinforcement learning (Grois and Wilkins, 2005), named entity resolution (Harabagiu et al., 2003) and mapping questions and answers to logical form (Wellner et al., 2006). Work on story understanding itself goes back much further, to 1972, when Charniak proposed using a background model to answer questions about children’s stories. Similarly, the TREC (and TAC) Question Answering tracks (e.g., Voorhees and Tice, 1999) aim to evaluate systems on their ability to answer factual questions such as “Where is the Taj Mahal”. The QA4MRE task also aims to evaluate machine reading systems through question answering (e.g., Clark et al., 2012). Earlier work has also aimed at controlling the scope by limiting the text to children’s stories: Breck et al. (2001) collected 75 stories from the Canadian Broadcasting Corporation’s web site for children, and generated 650 questions for them manually, where each question was answered by a sentence in the text. Leidner et al. (2003) both enriched the CBC4kids data by adding several layers of annotation (such as semantic and POS tags), and measured QA performance as a function of question difficulty. For a further compendium of resources related to the story comprehension task, see Mueller (2010). The task proposed here differs from the above work in several ways. Most importantly, the data collection is scalable: if the dataset proves sufficiently useful to others, it would be straightforward to gather an order of magnitude more. Even the dataset size presented here is an order of magnitude larger than the Remedia or the CBC4kids data and many times larger than QA4MRE. Second, the multiple choice task presents less ambiguity (and is consequently easier to collect data for) than the ly from MC500 train set). task of finding the most appropriate sentence, and may be automatically evaluated. Further, our stories are fictional, which means that the information to answer the question is contained only in the story itself (as opposed to being able to directly leverage knowledge repositories such as Wikipedia). 195 This design was chosen to focus the task on the machine understanding of short passages, rather than the ability to match against an existing knowledge base. In addition, while in the CBC4kids data each answer was a sentence from the story, here we required that approximately half of the questions require at least two sentences from the text to answer; being able to control complexity in this way is a further benefit of using multiple choice answers. Finally, as explained in Section 1, the use of free-form input makes the problem open domain (as opposed to the ATIS, Geoquery and Jobs data), leading to the hope that solutions to the task presented here will be easier to apply to novel, unrelated tasks. 3 Generating the Stories and Questions Our aim was to generate a corpus of fictional story that could be scaled with as little expert input as possible. Thus, we designed the process to be gated by cost, and keeping the costs low was a high priority. Crowd-sourcing seemed particularly appropriate, given the nature of the task, so we opted to use Amazon Mechanical Turk2 (AMT). With over 500,000 workers3, it provides the work sets1 force required to both achieve scalability and, equally importantly, to provide diversity in the stories and types of questions. We restricted our task to AMT workers (workers) residing in the United States. The average worker is 36 years old, more educated than the United States population in general (Paolacci et al., 2010), and the majority of workers are female. 3.1 The Story and Questions Workers were instructed to write a short (150-300 words) fictional story, and to write as if for a child in grade school. The choice of 150-300 was made to keep the task an appropriate size for workers while still allowing for complex stories and questions. The workers were free to write about any topic they desired (as long as it was appropriate for a young child), and so there is a wide range, including vacations, animals, school, cars, eating, gardening, fairy tales, spaceships, and cowboys. 1 We use the term “story set” to denote the fictional story together with its multiple choice questions, hypothetical answers, and correct answer labels. 2 http://www.mturk.com 3 https://requester.mturk.com/tour Workers were also asked to provide four reading comprehension questions pertaining to their story and, for each, four multiple-choice answers. Coming up with incorrect alternatives (distractors) is a difficult task (see, e.g., Agarwal, 2011) but workers were requested to provide “reasonable” incorrect answers that at least include words from the story so that their solution is not trivial. For example, for the question “What is the name of the dog?”, if only one of the four answers occurs in the story, then that answer must be the correct one. Finally, workers were asked to design their questions and answers such that at least two of the four questions required multiple sentences from the story to answer them. That is, for those questions it should not be possible to find the answer in any individual sentence. The motivation for this was to ensure that the task could not be fully solved using lexical techniques, such as word matching, alone. Whilst it is still possible that a sophisticated lexical analysis could completely solve the task, requiring that answers be constructed from at least two different sentences in the story makes this much less likely; our hope is that the solution will instead require some inference and some form of limited reasoning. This hope rests in part upon the observation that standardized reading comprehension tests, whose goal after all is to test comprehension, generally avoid questions that can be answered by reading a single sentence. 3.2 Automatic Validation Besides verifying that the story and all of the questions and answers were provided, we performed the following automatic validation before allowing the worker to complete the task: Limited vocabulary: The lowercase words in the story, questions, and answers were stemmed and checked against a vocabulary list of approximately 8000 words that a 7-year old is likely to know (Kuperman et al., 2012). Any words not on the list were highlighted in red as the worker typed, and the task could not be submitted unless all of the words satisfied this vocabulary criterion. To allow the use of arbitrary proper nouns, capitalized words were not checked against the vocabulary list. Multiple-sentence questions: As described earlier, we required that at least two of the questions need multiple sentences to answer. Workers were simply asked to mark whether a question needs one 196 or multiple sentences and we required that at least two are marked as multiple. 3.3 The Workers Workers were required to reside in the United States and to have completed 100 HITs with an over 95% approval The median worker took 22 minutes to complete the task. We paid workers $2.50 per story set and allowed each to do a maximum of 8 tasks (5 in MC500). We did not experiment with paying less, but this rate amounts to $6.82/hour, which is approximately the rate paid by other writing tasks on AMT at the time, though is also significantly higher than the median wage of $1.38 found in 2010 (Horton and Chilton, 2010). Workers could optionally leave feedback on the task, which was overwhelmingly positive – the most frequent non-stopword in the comments was “fun” and the most frequent phrase was “thank you”. The only negative comments (in <1% of submissions) were when the worker felt that a particular word should have been on the allowed vocabulary list. Given the positive feedback, it may be possible to pay less if we collect more data in the future. We did not enforce story length constraints, but some workers interpreted our suggestion that the story be 150-300 words as a hard rate4. constraint, and some asked to be able to write a longer story. The MCTest corpus contains two sets of stories, named MC160 and MC500, and containing 160 and 500 stories respectively. MC160 was gathered first, then some improvements were made before gathering MC500. We give details on the differences between these two sets below. 3.4 MC160: Manually Curated for Quality In addition to the details described above, MC160 workers were given a target elementary grade school level (1-4) and a sample story matching that grade level5. The intent was to produce a set of stories and questions that varied in difficulty so that research work can progress grade-by-grade if needed. However, we found little difference between grades in the corpus.. After gathering the stories, we manually curated the MC160 corpus by reading each story set and 4 The latter two are the default AMT requirements. 5 From http://www.englishforeveryone.org/. correcting errors. The most common mistakes were grammatical, though occasionally questions and/or answers needed to be fixed. 66% of the stories have at least one correction. We provide both the curated and original corpuses in order to allow research on reading comprehension in the presence of grammar, spelling, and other mistakes. 3.5 MC500: Adding a Grammar Test Though the construction of MC160 was successful, it requires a costly curation process which will not scale to larger data sets (although the curation was useful, both for improving the design of MC500, and for assessing the effectiveness of automated curation techniques). To more fully automate the process, we added two more stages: (1) A grammar test that automatically pre-screens workers for writing ability, and (2) a second Mechanical Turk task whereby new workers take the reading comprehension tests and rate their quality. We will discuss stage (2) in the next section. The grammar test consisted of 20 sentences, half of which had one grammatical error (see Figure 2). The incorrect sentences were written using common errors such as you ’re vs. your, using ‘s to indicate plurality, incorrect use of tense, it’ ’s vs. its, 197 NGoGramram mar TreTsets Q(u134a-.52l3i) tyaAn73ib m0o% aults Table 1. Pre-screening workers using a grammar test improves both quality and diversity of stories. Both differences are significant using the two-tailed t-test (p<0.05 for quality and p<0.01 for animals). less vs. fewer, I me, etc. Workers were required vs. to indicate for each sentence whether it was grammatically correct or not, and had to pass with at least 80% accuracy in order to qualify for the task. The 80% threshold was chosen to trade off worker quality with the rate at which the tasks would be completed; initial experiments using a threshold of 90% indicated that collecting 500 stories would take many weeks instead of days. Note that each worker is allowed to write at most 5 stores, so we required at least 100 workers to pass the qualification test. To validate the use of the qualification test, we gathered 30 stories requiring the test (qual) and 30 stories without. We selected a random set of 20 stories (10 from each), hid their origin, and then graded the overall quality of the story and questions from 1-5, meaning do not attempt to fix, bad but rescuable, has non-minor problems, has only minor problems, and has no problems, respectively. Results are shown in Table 1. The difference is statistically significant (p<0.05, using the twotailed t-test). The qual stories were also more diverse, with fewer of them about animals (the most common topic). Additional Modifications: Based on our experience curating MC160, we also made the following modifications to the task. In order to eliminate trivially-answerable questions, we required that each answer be unique, and that either the correct answer did not appear in the story or, if it did appear, that at least two of the incorrect answers also appeared in the story. This is to prevent questions that are trivially answered by checking which answer appears in the story. The condition on whether the correct answer appears is to allow questions such as “How many candies did Susan eat?”, where the total may never appear in the story, even though the information needed to derive it does. An answer is considered to appear in the story if at least half (rounded down) of its non-stopword terms appear in the story (ignoring word endings). This check is done automatically and must be satisfied before the worker is able to complete the task. Workers could also bypass the check if they felt it was incorrect, by adding a special term to their answer. We were also concerned that the sample story might bias the workers when writing the story set, particularly when designing questions that require multiple sentences to answer. So, we removed the sample story and grade level from the task. Finally, in order to encourage more diversity of stories, we added creativity terms, a set of 15 nouns chosen at random from the allowed vocabulary set. Workers were asked to “please consider” using one or more of the terms in their story, but use of the words was strictly optional. On average, workers used 3.9 of the creativity terms in their stories. 4 Rating the Stories and Questions In this section we discuss the crowd-sourced rating of story sets. We wished to ensure story set quality despite the fact that MC500 was only minimally manually curated (see below). Pre-qualifying workers with a grammar test was one step of this process. The second step was to have additional workers on Mechanical Turk both evaluate each story and take its corresponding test. Each story was evaluated in this way by 10 workers, each of whom provided scores for each of ageappropriateness (yes/maybe/no), grammaticality (few/some/many errors), and story clarity (excellent/reasonable/poor). When answering the four reading comprehension questions, workers could also mark a question as “unclear”. Each story set was rated by 10 workers who were each paid $0. 15 per set. Since we know the purportedly correct answer, we can estimate worker quality by measuring what fraction of questions that worker got right. Workers with less than 80% accuracy (ignoring those questions marked as unclear) were removed from the set. This constituted just 4.1% of the raters and 4.2% of the judgments (see Figure 3). Only one rater appeared to be an intentional spammer, answering 1056 questions with only 29% accuracy. The others primarily judged only one story. Only one worker fell between, answering 336 questions with just 75% accuracy. 198 Figure 3. Just 4.1% of raters had an accuracy below 80% (constituting 4.2% of the judgments). For the remaining workers (those who achieved at least 80% accuracy), we measured median story appropriateness, grammar, and clarity. For each category, stories for which less than half of the ratings were the best possible (e.g., excellent story clarity) were inspected and optionally removed from the data set. This required inspecting 40 (<10%) of the stories, only 2 of which were deemed poor enough to be removed (both of which had over half of the ratings all the way at the bot- tom end of the scale, indicating we could potentially have inspected many fewer stories with the same results). We also inspected questions for which at least 5 workers answered incorrectly, or answered “unclear”. In total, 29 questions (<2%) were inspected. 5 were fixed by changing the question, 8 by changing the answers, 2 by changing both, 6 by changing the story, and 8 were left unmodified. Note that while not fully automated, this process of inspecting stories and repairing questions took one person one day, so is still scalable to at least an order of magnitude more stories. 5 Dataset Analysis In Table 2, we present results demonstrating the value of the grammar test and curation process. As expected, manually curating MC160 resulted in increased grammar quality and percent of questions answered correctly by raters. The goal of MC500 was to find a more scalable method to achieve the same quality as the curated MC160. As Table 2 shows, the grammar test improved story grammar quality from 1.70 to 1.77 (both uncurated). The rating and one-day curation process in- TS51a06bet0l cu2r.ateAdvra1 Ag.9e8241aAgpe1 Cp.67la13r57oiptyrae1 Gn.7er8a90s74mǂ, satroy9C 567oc.lr397a eritcy, grammar quality (0-2, with 2 being best), and percent of questions answered correctly by raters, for the original and curated versions of the data. Bold indicates statistical significance vs. the original version of the same set, using the two-sample t-test with unequal variance. The indicates the only statistical difference between 500 curated and 160 curated. ǂ TMCaob lre5p10u63s. CoSr51tp06uise tawM2it06sreimt dcinsea gfnorS2Mt1o0C2r4Ay1v6eQ0raug7e8n.ds0W7tMionCrd5s0APne3.sr:4w er creases this to 1.79, whereas a fully manual curation results in a score of 1.84. Curation also improved the percent of questions answered correctly for both MC160 and MC500, but, unlike with grammar, there is no significant difference between the two curated sets. Indeed, the only statis- tically significant difference between the two is in grammar. So, the MC500 grammar test and curation process is a very scalable method for collecting stories of nearly the quality of the costly manual curation of MC160. We also computed correlations between these measures of quality and various factors such as story length and time spent writing the story. On MC500, there is a mild correlation between a worker’s grammar test score and the judged grammar quality of that worker’s story (correlation of 0.24). Interestingly, this relation disappeared once MC500 was curated, likely due to repairing the stories with the worst grammar. On MC160, there is a mild correlation between the clarity and the number of words in the question and answer (0.20 and 0.18). All other correlations were below 0. 15. These factors could be integrated into an estimate for age-appropriateness, clarity, and grammar, potentially reducing the need for raters. Table 3 provides statistics on each corpus. MC160 and MC500 are similar in average number of words per story, question, and answer, as well as the median writing time. The most commonly used 199 Baseline Algorithms Require: Passage P, set of passage words PW, ith word in passage Pi, set of words in question Q, set of words in hypothesized answers A1..4, and set of stop words U, Define: ( ) ∑( ) Define: ( ) ( ( )). Algorithm 1 Sliding Window Algorithm 1 Sliding Window for i= 1to 4 do | | ∑ | |{ ( ) end for return Algorithm 2 Distance Based for i= 1to 4 do ( ) (( ) ) if | | else or | | | |( ), where ()is the minimum number of words between an occurrence of q and an occurrence of a in P, plus one. end if end for return Algorithm Return SW Algorithm SW+D Return Figure 4. The two lexical-based algorithms used for the baselines. nouns in MC500 are: day, friend, time, home, house, mother, dog, mom, school, dad, cat, tree, and boy. The stories vary widely in theme. The first 10 stories of the randomly-ordered MC500 set are about: travelling to Miami to visit friends, waking up and saying hello to pets, a bully on a schoolyard, visiting a farm, collecting insects at Grandpa’s house, planning a friend’s birthday party, selecting clothes for a school dance, keeping animals from eating your ice cream, animals ordering food, and adventures of a boy and his dog. TSMaAiblCnuge1tli460.Per5TcS9reW.an54it360 caonrQdS6e’W7c8Dst.4+1e7vfD5o:rthem465S8u.W4l28t93ip Tl4e0cQsht:o’S5i76Wc e.8+2q1D95ues- tions for MC160. SW: sliding window algorithm. SW+D: combined results with sliding window and distance based algorithms. Single/Multi: questions marked by worker as requiring a single/multiple sentence(s) to answer. All differences between SW and SW+D are significant (p<0.01 using the two-tailed paired t-test). TASMabiluCnge5tli0.Pe5T4rSc92a.We18ni304t ac0noSQrd56W’8e1D.sc2+7et8Dv1fo:rt5hS1eW.85m603uTletQsiSp5W:l’76es.31+c570hDoiceS65Wq0A7u.4l5+e 3Ds- tions for MC500, notation as above. All differences between SW and SW+D are significant (p<0.01, tested as above). We randomly divided MC160 and MC500 into train, development, and test sets of 70, 30, and 60 stories and 300, 50, and 150 stories, respectively. 6 Baseline System and Results We wrote two baseline systems, both using only simple lexical features. The first system used a sliding window, matching a bag of words constructed from the question and hypothesized answer to the text. Since this ignored long range dependencies, we added a second, word-distance based algorithm. The distance-based score was simply subtracted from the window-based score to arrive at the final score (we tried scaling the distance score before subtraction but this did not improve results on the MC160 train set). The algorithms are summarized in Figure 4. A coin flip is used to break ties. The use of inverse word counts was inspired by TF-IDF. Results for MC160 and MC500 are shown in Table 4 and Table 5. The MC160 train and development sets were used for tuning. The baseline algorithm was authored without seeing any portion of MC500, so both the MC160 test set and all of 200 BRCoaTsmEelbin e d(SW+D)65 M967C. 76219506ǂ 0Test5 6M603C. 685 7320ǂ 0Test Table 6. Percent correct for MC160 and MC500 test sets. The ǂ indicates statistical significance vs. baseline (p<0.01 using the two-tailed paired t-test). MC160 combined vs. baseline has p-value 0.063. MC500 were used for testing (although we nevertheless report results on the train/test split). Note that adding the distance based algorithm improved accuracy by approximately 10% absolute on MC160 and approximately 6% on MC500. Overall, error rates on MC500 are higher than on MC160, which agrees with human performance (see Table 2), suggesting that MC500’s questions are more difficult. 7 Recognizing Textual Entailment Results We also tried using a “recognizing textual entailment” (RTE) system to answer MCTest questions. The goal of RTE (Dagan et al., 2005) is to determine whether a given statement can be inferred from a particular text. We can cast MCTest as an RTE task by converting each question-answer pair into a statement, and then selecting the answer whose statement has the highest likelihood of being entailed by the story. For example, in the sample story given in Figure 1, the second question can be converted into four statements (one for each answer), and the RTE system should select the statement “James pulled pudding off of the shelves in the grocery store” as the most likely one. For converting question-answer pairs to statements, we used the rules employed in a web-based question answering system (Cucerzan and Agichtein, 2005). For RTE, we used BIUTEE (Stern and Dagan, 2011), which performs better than the median system in the past four RTE competitions. We ran BIUTEE both in its default configuration, as well as with its optional additional data sources (FrameNet, ReVerb, DIRT, and others as found on the BIUTEE home page). The default configuration performed better so we present its results here. The results in Table 6 show that the RTE method performed worse than the baseline. We also combined the baseline and RTE system by training BIUTEE on the train set and using the development set to optimize a linear combination of BIUTEE with the baseline; the combined system outperforms either component system on MC500. It is possible that with some tuning, an RTE system will outperform our baseline system. Nevertheless, these RTE results, and the performance of the baseline system, both suggest that the reading comprehension task described here will not be trivially solved by off-the-shelf techniques. 8 Making Data and Results an Ongoing Resource Our goal in constructing this data is to encourage research and innovation in the machine comprehension of text. Thus, we have made both MC160 and MC500 freely available for download at http://research.microsoft.com/mct. To our knowledge, these are the largest copyright-free reading comprehension data sets publicly available. To further encourage research on these data, we will be continually updating the webpage with the bestknown published results to date, along with pointers to those publications. One of the difficulties in making progress on a particular task is implementing previous work in order to apply improvements to it. To mitigate this difficulty, we are encouraging researchers who use the data to (optionally) provide per-answer scores from their system. Doing so has three benefits: (a) a new system can be measured in the context of the errors made by the previous systems, allowing each research effort to incrementally add useful functionality without needing to also re-implement the current state-of-the-art; (b) it allows system performance to be measured using paired statistical testing, which will substantially increase the ability to determine whether small improvements are significant; and (c) it enables researchers to perform error analysis on any of the existing systems, simplifying the process of identifying and tackling common sources of error. We will also periodically ensemble the known systems using standard machine learning techniques and make those results available as well (unless the existing state-of-theart already does such ensembling). The released data contains the stories and questions, as well as the results from workers who rated 201 the stories and took the tests. The latter may be used, for example, to measure machine performance vs. human performance on a per-question basis (i.e., does your algorithm make similar mistakes to humans?), or vs. the judged clarity of each story. The ratings, as well as whether a question needs multiple sentences to answer, should typically only be used in evaluation, since such information is not generally available for most text. We will also provide an anonymized author id for each story, which could allow additional research such as using other works by the same author when understanding a story, or research on authorship attribution (e.g., Stamatatos, 2009). 9 Future Work We plan to use this dataset to evaluate approaches for machine comprehension, but are making it available now so that others may do the same. If MCTest is used we will collect more story sets and will continue to refine the collection process. One interesting research direction is ensuring that the questions are difficult enough to challenge state-ofthe-art techniques as they develop. One idea for this is to apply existing techniques automatically during story set creation to see whether a question is too easily answered by a machine. By requiring authors to create difficult questions, each data set will be made more and more difficult (but still answerable by humans) as the state-of-the-art methods advance. We will also experiment with timing the raters as they answer questions to see if we can find those that are too easy for people to answer. Removing such questions may increase the difficulty for machines as well. Additionally, any divergence between how easily a person answers a question vs. how easily a machine does may point toward new techniques for improving machine comprehension; we plan to conduct research in this direction as well as make any such data available for others. 10 Conclusion We present the MCTest dataset in the hope that it will help spur research into the machine comprehension of text. The metric (the accuracy on the question sets) is clearly defined, and on that metric, lexical baseline algorithms only attain approximately 58% correct on test data (the MC500 set) as opposed to the 100% correct that the majority of crowd-sourced judges attain. A key component of MCTest is the scalable design: we have shown that data whose quality approaches that of expertly cu- rated data can be generated using crowd sourcing coupled with expert correction of worker-identified errors. Should MCTest prove useful to the community, we will continue to gather data, both to increase the corpus size, and to keep the test sets fresh. The data is available at http://research.microsoft.com/mct and any submitted results will be posted there too. Because submissions will be requested to include the score for each test item, researchers will easily be able to compare their systems with those of others, and investigation of ensembles comprised of components from several different teams will be straightforward. MCTest also contains supplementary material that researchers may find useful, such as worker accuracies on a grammar test and crowd-sourced measures of the quality of their stories. Acknowledgments We would like to thank Silviu Cucerzan and Lucy Vanderwende for their help with converting questions to statements and other useful discussions. References M. Agarwal and P. Mannem. 2011. Automatic Gap-fill Question Generation from Text Books. In Proceed- ings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications, 56–64. E. Breck, M. Light, G.S.Mann, E. Riloff, B. Brown, P. Anand, M. Rooth M. Thelen. 2001. Looking under the hood: Tools for diagnosing your question answering engine. In Proceedings of the workshop on Opendomain question answering, 12, 1-8. E. Charniak. 1972. Toward a Model of Children’s Story Comprehension. Technical Report, 266, MIT Artificial Intelligence Laboratory, Cambridge, MA. P. Clark, P. Harrison, and X. Yao. An Entailment-Based Approach to the QA4MRE Challenge. 2012. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF) 2012. S. Cucerzan and E. Agichtein. 2005. Factoid Question Answering over Unstructured and Structured Content on the Web. In Proceedings of the Fourteenth Text Retrieval Conference (TREC). I. Dagan, O. Glickman, and B. Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini, F. d'Alché-Buc (Eds.), Machine Learning 202 Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer. D. Goldwasser, R. Reichart, J. Clarke, D. Roth. 2011. Confidence Driven Unsupervised Semantic Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 1486-1495. E. Grois and D.C. Wilkins. 2005. Learning Strategies for Story Comprehension: A Reinforcement Learning Approach. In Proceedings of the Twenty Second International Conference on Machine Learning, 257264. S.M. Harabagiu, S.J. Maiorano, and M.A. Pasca. 2003. Open-Domain Textual Question Answering Techniques. Natural Language Engineering, 9(3): 1-38. Cambridge University Press, Cambridge, UK. L. Hirschman, M. Light, E. Breck, and J.D. Burger. 1999. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), 325-332. J. Horton and L. Chilton. 2010. The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM Conference on Electronic Commerce, 209-218. V. Kuperman, H. Stadthagen-Gonzalez, M. Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4):978-990. J.L. Leidner, T. Dalmas, B. Webber, J. Bos, C. Grover. 2003. Automatic Multi-Layer Corpus Annotation for Evaluating Question Answering Methods: CBC4Kids. In Proceedings of the 3rd International Workshop on Linguistically Interpreted Corpora. E.T. Mueller. 2010. Story Understanding Resources. http://xenia.media.mit.edu/~mueller/storyund/storyre s.html. G. Paolacci, J. Chandler, and P. Iperirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making. 5(5):41 1-419. E. Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci., 60:538– 556. A. Stern and I. Dagan. 2011. A Confidence Model for Syntactically-Motivated Entailment Proofs. In Proceedings of Recent Advances in Natural Language Processing (RANLP). L.R. Tang and R.J. Mooney. 2001. Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing. In Proceedings of the European Conference on Machine Learning (ECML), 466-477. G. Tur, D. Hakkani-Tur, and L.Heck. 2010. What is left to be understood in ATIS? Spoken Language Technology Workshop, 19-24. E.M. Voorhees and D.M. Tice. 1999. The TREC-8 Question Answering Track Evaluation. In Proceedings of the Eighth Text Retrieval Conference (TREC8). 12th Wellner, L. Ferro, W. Greiff, and L. Hirschman. 2005. Reading comprehension tests for computerbased understand evaluation. Natural Language Engineering, 12(4):305-334. Cambridge University Press, Cambridge, UK. J.M. Zelle and R.J. Mooney. 1996. Learning to Parse Database Queries using Inductive Logic Programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI), 10501055. B. Zettlemoyer and M. Collins. 2009. Learning Context-Dependent Mappings from Sentences to Logical Form. In Proceedings of the 47th Annual Meeting of the Association for Computation Linguistics (ACL), 976-984. G. Zweig and C.J.C. Burges. 2012. A Challenge Set for Advancing Language Modeling. In Proceedings of the Workshop on the Future of Language Modeling for HLT, NAACL-HLT. L.S. 203

5 0.16858251 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations

Author: Joo-Kyung Kim ; Marie-Catherine de Marneffe

Abstract: Continuous space word representations extracted from neural network language models have been used effectively for natural language processing, but until recently it was not clear whether the spatial relationships of such representations were interpretable. Mikolov et al. (2013) show that these representations do capture syntactic and semantic regularities. Here, we push the interpretation of continuous space word representations further by demonstrating that vector offsets can be used to derive adjectival scales (e.g., okay < good < excellent). We evaluate the scales on the indirect answers to yes/no questions corpus (de Marneffe et al., 2010). We obtain 72.8% accuracy, which outperforms previous results (∼60%) on tichihs corpus aornmd highlights sth rees quality o6f0% the) scales extracted, providing further support that the continuous space word representations are meaningful.

6 0.15391544 155 emnlp-2013-Question Difficulty Estimation in Community Question Answering Services

7 0.11058248 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

8 0.091304064 160 emnlp-2013-Relational Inference for Wikification

9 0.087354854 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

10 0.083318949 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision

11 0.081823416 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

12 0.077270731 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

13 0.076659247 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

14 0.073071972 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching

15 0.069625936 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

16 0.067086644 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

17 0.066485241 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions

18 0.062587343 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

19 0.061425067 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

20 0.058278468 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.216), (1, 0.068), (2, -0.01), (3, 0.045), (4, -0.034), (5, 0.171), (6, 0.023), (7, 0.04), (8, 0.112), (9, 0.007), (10, 0.068), (11, 0.229), (12, -0.178), (13, -0.146), (14, 0.117), (15, 0.143), (16, 0.096), (17, 0.346), (18, 0.129), (19, -0.068), (20, 0.044), (21, -0.034), (22, 0.11), (23, -0.112), (24, 0.043), (25, -0.13), (26, -0.048), (27, 0.003), (28, -0.015), (29, 0.034), (30, 0.081), (31, -0.023), (32, -0.076), (33, -0.015), (34, -0.036), (35, 0.007), (36, 0.035), (37, -0.053), (38, 0.114), (39, 0.097), (40, 0.013), (41, 0.137), (42, -0.079), (43, 0.086), (44, 0.012), (45, -0.022), (46, 0.027), (47, -0.036), (48, -0.001), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98813194 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

Author: Aliaksei Severyn ; Alessandro Moschitti

Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.

2 0.78057528 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

Author: Matthew Richardson ; Christopher J.C. Burges ; Erin Renshaw

Abstract: Christopher J.C. Burges Microsoft Research One Microsoft Way Redmond, WA 98052 cburge s @micro so ft . com Erin Renshaw Microsoft Research One Microsoft Way Redmond, WA 98052 erinren@mi cros o ft . com disciplines are focused on this problem: for example, information extraction, relation extraction, We present MCTest, a freely available set of stories and associated questions intended for research on the machine comprehension of text. Previous work on machine comprehension (e.g., semantic modeling) has made great strides, but primarily focuses either on limited-domain datasets, or on solving a more restricted goal (e.g., open-domain relation extraction). In contrast, MCTest requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Reading comprehension can test advanced abilities such as causal reasoning and understanding the world, yet, by being multiple-choice, still provide a clear metric. By being fictional, the answer typically can be found only in the story itself. The stories and questions are also carefully limited to those a young child would understand, reducing the world knowledge that is required for the task. We present the scalable crowd-sourcing methods that allow us to cheaply construct a dataset of 500 stories and 2000 questions. By screening workers (with grammar tests) and stories (with grading), we have ensured that the data is the same quality as another set that we manually edited, but at one tenth the editing cost. By being open-domain, yet carefully restricted, we hope MCTest will serve to encourage research and provide a clear metric for advancement on the machine comprehension of text. 1 Reading Comprehension A major goal for NLP is for machines to be able to understand text as well as people. Several research 193 semantic role labeling, and recognizing textual entailment. Yet these techniques are necessarily evaluated individually, rather than by how much they advance us towards the end goal. On the other hand, the goal of semantic parsing is the machine comprehension of text (MCT), yet its evaluation requires adherence to a specific knowledge representation, and it is currently unclear what the best representation is, for open-domain text. We believe that it is useful to directly tackle the top-level task of MCT. For this, we need a way to measure progress. One common method for evaluating someone’s understanding of text is by giving them a multiple-choice reading comprehension test. This has the advantage that it is objectively gradable (vs. essays) yet may test a range of abilities such as causal or counterfactual reasoning, inference among relations, or just basic understanding of the world in which the passage is set. Therefore, we propose a multiple-choice reading comprehension task as a way to evaluate progress on MCT. We have built a reading comprehension dataset containing 500 fictional stories, with 4 multiple choice questions per story. It was built using methods which can easily scale to at least 5000 stories, since the stories were created, and the curation was done, using crowd sourcing almost entirely, at a total of $4.00 per story. We plan to periodically update the dataset to ensure that methods are not overfitting to the existing data. The dataset is open-domain, yet restricted to concepts and words that a 7 year old is expected to understand. This task is still beyond the capability of today’s computers and algorithms. ProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 1t9ic3s–203, By restricting the concept space, we gain the difficulty of being an open-domain problem, without the full complexity of the real world (for example, there will be no need for the machine to understand politics, technology, or to have any domain specific expertise). The multiple choice task avoids ambiguities (such as when the task is to find a sentence that best matches a question, as in some early reading comprehension tasks: see Section 2), and also avoids the need for additional grading, such as is needed in some TREC tasks. The stories were chosen to be fictional to focus work on finding the answer in the story itself, rather than in knowledge repositories such as Wikipedia; the goal is to build technology that actually understands stories and paragraphs on a deep level (as opposed to using information retrieval methods and the redundancy of the web to find the answers). We chose to use crowd sourcing, as opposed to, for example, contracting teachers or paying for existing standardized tests, for three reasons, namely: (1) scalability, both for the sizes of datasets we can provide, and also for the ease of regularly refreshing the data; (2) for the variety in story-telling that having many different authors brings; and (3) for the free availability that can only result from providing non-copyrighted data. The content is freely available at http://research.microsoft.com/mct, and we plan to use that site to track published results and provide other resources, such as labels of various kinds. 2 Previous Work The research goal of mapping text to meaning representations in order to solve particular tasks has a long history. DARPA introduced the Airline Travel Information System (ATIS) in the early 90’s: there the task was to slot-fill flight-related information by modeling the intent of spoken language (see Tur et al., 2010, for a review). This data continues to be a used in the semantic modeling community (see, for example, Zettlemoyer and Collins, 2009). The Geoquery database contains 880 geographical facts about the US and has played a similar role for written (as opposed to spoken) natural language queries against a database (Zelle and Mooney, 1996) and it also continues to spur research (see for example Goldwasser et al., 2011), as does the similar Jobs database, which provides mappings of 640 sentences to a listing of jobs 194 (Tang and Mooney, 2001). More recently, Zweig and Burges (2012) provided a set of 1040 sentences that comprise an SAT-style multiple choice sentence completion task. The idea of using story-based reading comprehension questions to evaluate methods for machine reading itself goes back over a decade, when Hirschmann et al. (1999) showed that a bag of words approach, together with some heuristic linguistic modeling, could achieve 40% accuracy for the task of picking the sentence that best matches the query for “who / what / when / where / why” questions, on a small reading comprehension dataset from Remedia. This dataset spurred several research efforts, for example using reinforcement learning (Grois and Wilkins, 2005), named entity resolution (Harabagiu et al., 2003) and mapping questions and answers to logical form (Wellner et al., 2006). Work on story understanding itself goes back much further, to 1972, when Charniak proposed using a background model to answer questions about children’s stories. Similarly, the TREC (and TAC) Question Answering tracks (e.g., Voorhees and Tice, 1999) aim to evaluate systems on their ability to answer factual questions such as “Where is the Taj Mahal”. The QA4MRE task also aims to evaluate machine reading systems through question answering (e.g., Clark et al., 2012). Earlier work has also aimed at controlling the scope by limiting the text to children’s stories: Breck et al. (2001) collected 75 stories from the Canadian Broadcasting Corporation’s web site for children, and generated 650 questions for them manually, where each question was answered by a sentence in the text. Leidner et al. (2003) both enriched the CBC4kids data by adding several layers of annotation (such as semantic and POS tags), and measured QA performance as a function of question difficulty. For a further compendium of resources related to the story comprehension task, see Mueller (2010). The task proposed here differs from the above work in several ways. Most importantly, the data collection is scalable: if the dataset proves sufficiently useful to others, it would be straightforward to gather an order of magnitude more. Even the dataset size presented here is an order of magnitude larger than the Remedia or the CBC4kids data and many times larger than QA4MRE. Second, the multiple choice task presents less ambiguity (and is consequently easier to collect data for) than the ly from MC500 train set). task of finding the most appropriate sentence, and may be automatically evaluated. Further, our stories are fictional, which means that the information to answer the question is contained only in the story itself (as opposed to being able to directly leverage knowledge repositories such as Wikipedia). 195 This design was chosen to focus the task on the machine understanding of short passages, rather than the ability to match against an existing knowledge base. In addition, while in the CBC4kids data each answer was a sentence from the story, here we required that approximately half of the questions require at least two sentences from the text to answer; being able to control complexity in this way is a further benefit of using multiple choice answers. Finally, as explained in Section 1, the use of free-form input makes the problem open domain (as opposed to the ATIS, Geoquery and Jobs data), leading to the hope that solutions to the task presented here will be easier to apply to novel, unrelated tasks. 3 Generating the Stories and Questions Our aim was to generate a corpus of fictional story that could be scaled with as little expert input as possible. Thus, we designed the process to be gated by cost, and keeping the costs low was a high priority. Crowd-sourcing seemed particularly appropriate, given the nature of the task, so we opted to use Amazon Mechanical Turk2 (AMT). With over 500,000 workers3, it provides the work sets1 force required to both achieve scalability and, equally importantly, to provide diversity in the stories and types of questions. We restricted our task to AMT workers (workers) residing in the United States. The average worker is 36 years old, more educated than the United States population in general (Paolacci et al., 2010), and the majority of workers are female. 3.1 The Story and Questions Workers were instructed to write a short (150-300 words) fictional story, and to write as if for a child in grade school. The choice of 150-300 was made to keep the task an appropriate size for workers while still allowing for complex stories and questions. The workers were free to write about any topic they desired (as long as it was appropriate for a young child), and so there is a wide range, including vacations, animals, school, cars, eating, gardening, fairy tales, spaceships, and cowboys. 1 We use the term “story set” to denote the fictional story together with its multiple choice questions, hypothetical answers, and correct answer labels. 2 http://www.mturk.com 3 https://requester.mturk.com/tour Workers were also asked to provide four reading comprehension questions pertaining to their story and, for each, four multiple-choice answers. Coming up with incorrect alternatives (distractors) is a difficult task (see, e.g., Agarwal, 2011) but workers were requested to provide “reasonable” incorrect answers that at least include words from the story so that their solution is not trivial. For example, for the question “What is the name of the dog?”, if only one of the four answers occurs in the story, then that answer must be the correct one. Finally, workers were asked to design their questions and answers such that at least two of the four questions required multiple sentences from the story to answer them. That is, for those questions it should not be possible to find the answer in any individual sentence. The motivation for this was to ensure that the task could not be fully solved using lexical techniques, such as word matching, alone. Whilst it is still possible that a sophisticated lexical analysis could completely solve the task, requiring that answers be constructed from at least two different sentences in the story makes this much less likely; our hope is that the solution will instead require some inference and some form of limited reasoning. This hope rests in part upon the observation that standardized reading comprehension tests, whose goal after all is to test comprehension, generally avoid questions that can be answered by reading a single sentence. 3.2 Automatic Validation Besides verifying that the story and all of the questions and answers were provided, we performed the following automatic validation before allowing the worker to complete the task: Limited vocabulary: The lowercase words in the story, questions, and answers were stemmed and checked against a vocabulary list of approximately 8000 words that a 7-year old is likely to know (Kuperman et al., 2012). Any words not on the list were highlighted in red as the worker typed, and the task could not be submitted unless all of the words satisfied this vocabulary criterion. To allow the use of arbitrary proper nouns, capitalized words were not checked against the vocabulary list. Multiple-sentence questions: As described earlier, we required that at least two of the questions need multiple sentences to answer. Workers were simply asked to mark whether a question needs one 196 or multiple sentences and we required that at least two are marked as multiple. 3.3 The Workers Workers were required to reside in the United States and to have completed 100 HITs with an over 95% approval The median worker took 22 minutes to complete the task. We paid workers $2.50 per story set and allowed each to do a maximum of 8 tasks (5 in MC500). We did not experiment with paying less, but this rate amounts to $6.82/hour, which is approximately the rate paid by other writing tasks on AMT at the time, though is also significantly higher than the median wage of $1.38 found in 2010 (Horton and Chilton, 2010). Workers could optionally leave feedback on the task, which was overwhelmingly positive – the most frequent non-stopword in the comments was “fun” and the most frequent phrase was “thank you”. The only negative comments (in <1% of submissions) were when the worker felt that a particular word should have been on the allowed vocabulary list. Given the positive feedback, it may be possible to pay less if we collect more data in the future. We did not enforce story length constraints, but some workers interpreted our suggestion that the story be 150-300 words as a hard rate4. constraint, and some asked to be able to write a longer story. The MCTest corpus contains two sets of stories, named MC160 and MC500, and containing 160 and 500 stories respectively. MC160 was gathered first, then some improvements were made before gathering MC500. We give details on the differences between these two sets below. 3.4 MC160: Manually Curated for Quality In addition to the details described above, MC160 workers were given a target elementary grade school level (1-4) and a sample story matching that grade level5. The intent was to produce a set of stories and questions that varied in difficulty so that research work can progress grade-by-grade if needed. However, we found little difference between grades in the corpus.. After gathering the stories, we manually curated the MC160 corpus by reading each story set and 4 The latter two are the default AMT requirements. 5 From http://www.englishforeveryone.org/. correcting errors. The most common mistakes were grammatical, though occasionally questions and/or answers needed to be fixed. 66% of the stories have at least one correction. We provide both the curated and original corpuses in order to allow research on reading comprehension in the presence of grammar, spelling, and other mistakes. 3.5 MC500: Adding a Grammar Test Though the construction of MC160 was successful, it requires a costly curation process which will not scale to larger data sets (although the curation was useful, both for improving the design of MC500, and for assessing the effectiveness of automated curation techniques). To more fully automate the process, we added two more stages: (1) A grammar test that automatically pre-screens workers for writing ability, and (2) a second Mechanical Turk task whereby new workers take the reading comprehension tests and rate their quality. We will discuss stage (2) in the next section. The grammar test consisted of 20 sentences, half of which had one grammatical error (see Figure 2). The incorrect sentences were written using common errors such as you ’re vs. your, using ‘s to indicate plurality, incorrect use of tense, it’ ’s vs. its, 197 NGoGramram mar TreTsets Q(u134a-.52l3i) tyaAn73ib m0o% aults Table 1. Pre-screening workers using a grammar test improves both quality and diversity of stories. Both differences are significant using the two-tailed t-test (p<0.05 for quality and p<0.01 for animals). less vs. fewer, I me, etc. Workers were required vs. to indicate for each sentence whether it was grammatically correct or not, and had to pass with at least 80% accuracy in order to qualify for the task. The 80% threshold was chosen to trade off worker quality with the rate at which the tasks would be completed; initial experiments using a threshold of 90% indicated that collecting 500 stories would take many weeks instead of days. Note that each worker is allowed to write at most 5 stores, so we required at least 100 workers to pass the qualification test. To validate the use of the qualification test, we gathered 30 stories requiring the test (qual) and 30 stories without. We selected a random set of 20 stories (10 from each), hid their origin, and then graded the overall quality of the story and questions from 1-5, meaning do not attempt to fix, bad but rescuable, has non-minor problems, has only minor problems, and has no problems, respectively. Results are shown in Table 1. The difference is statistically significant (p<0.05, using the twotailed t-test). The qual stories were also more diverse, with fewer of them about animals (the most common topic). Additional Modifications: Based on our experience curating MC160, we also made the following modifications to the task. In order to eliminate trivially-answerable questions, we required that each answer be unique, and that either the correct answer did not appear in the story or, if it did appear, that at least two of the incorrect answers also appeared in the story. This is to prevent questions that are trivially answered by checking which answer appears in the story. The condition on whether the correct answer appears is to allow questions such as “How many candies did Susan eat?”, where the total may never appear in the story, even though the information needed to derive it does. An answer is considered to appear in the story if at least half (rounded down) of its non-stopword terms appear in the story (ignoring word endings). This check is done automatically and must be satisfied before the worker is able to complete the task. Workers could also bypass the check if they felt it was incorrect, by adding a special term to their answer. We were also concerned that the sample story might bias the workers when writing the story set, particularly when designing questions that require multiple sentences to answer. So, we removed the sample story and grade level from the task. Finally, in order to encourage more diversity of stories, we added creativity terms, a set of 15 nouns chosen at random from the allowed vocabulary set. Workers were asked to “please consider” using one or more of the terms in their story, but use of the words was strictly optional. On average, workers used 3.9 of the creativity terms in their stories. 4 Rating the Stories and Questions In this section we discuss the crowd-sourced rating of story sets. We wished to ensure story set quality despite the fact that MC500 was only minimally manually curated (see below). Pre-qualifying workers with a grammar test was one step of this process. The second step was to have additional workers on Mechanical Turk both evaluate each story and take its corresponding test. Each story was evaluated in this way by 10 workers, each of whom provided scores for each of ageappropriateness (yes/maybe/no), grammaticality (few/some/many errors), and story clarity (excellent/reasonable/poor). When answering the four reading comprehension questions, workers could also mark a question as “unclear”. Each story set was rated by 10 workers who were each paid $0. 15 per set. Since we know the purportedly correct answer, we can estimate worker quality by measuring what fraction of questions that worker got right. Workers with less than 80% accuracy (ignoring those questions marked as unclear) were removed from the set. This constituted just 4.1% of the raters and 4.2% of the judgments (see Figure 3). Only one rater appeared to be an intentional spammer, answering 1056 questions with only 29% accuracy. The others primarily judged only one story. Only one worker fell between, answering 336 questions with just 75% accuracy. 198 Figure 3. Just 4.1% of raters had an accuracy below 80% (constituting 4.2% of the judgments). For the remaining workers (those who achieved at least 80% accuracy), we measured median story appropriateness, grammar, and clarity. For each category, stories for which less than half of the ratings were the best possible (e.g., excellent story clarity) were inspected and optionally removed from the data set. This required inspecting 40 (<10%) of the stories, only 2 of which were deemed poor enough to be removed (both of which had over half of the ratings all the way at the bot- tom end of the scale, indicating we could potentially have inspected many fewer stories with the same results). We also inspected questions for which at least 5 workers answered incorrectly, or answered “unclear”. In total, 29 questions (<2%) were inspected. 5 were fixed by changing the question, 8 by changing the answers, 2 by changing both, 6 by changing the story, and 8 were left unmodified. Note that while not fully automated, this process of inspecting stories and repairing questions took one person one day, so is still scalable to at least an order of magnitude more stories. 5 Dataset Analysis In Table 2, we present results demonstrating the value of the grammar test and curation process. As expected, manually curating MC160 resulted in increased grammar quality and percent of questions answered correctly by raters. The goal of MC500 was to find a more scalable method to achieve the same quality as the curated MC160. As Table 2 shows, the grammar test improved story grammar quality from 1.70 to 1.77 (both uncurated). The rating and one-day curation process in- TS51a06bet0l cu2r.ateAdvra1 Ag.9e8241aAgpe1 Cp.67la13r57oiptyrae1 Gn.7er8a90s74mǂ, satroy9C 567oc.lr397a eritcy, grammar quality (0-2, with 2 being best), and percent of questions answered correctly by raters, for the original and curated versions of the data. Bold indicates statistical significance vs. the original version of the same set, using the two-sample t-test with unequal variance. The indicates the only statistical difference between 500 curated and 160 curated. ǂ TMCaob lre5p10u63s. CoSr51tp06uise tawM2it06sreimt dcinsea gfnorS2Mt1o0C2r4Ay1v6eQ0raug7e8n.ds0W7tMionCrd5s0APne3.sr:4w er creases this to 1.79, whereas a fully manual curation results in a score of 1.84. Curation also improved the percent of questions answered correctly for both MC160 and MC500, but, unlike with grammar, there is no significant difference between the two curated sets. Indeed, the only statis- tically significant difference between the two is in grammar. So, the MC500 grammar test and curation process is a very scalable method for collecting stories of nearly the quality of the costly manual curation of MC160. We also computed correlations between these measures of quality and various factors such as story length and time spent writing the story. On MC500, there is a mild correlation between a worker’s grammar test score and the judged grammar quality of that worker’s story (correlation of 0.24). Interestingly, this relation disappeared once MC500 was curated, likely due to repairing the stories with the worst grammar. On MC160, there is a mild correlation between the clarity and the number of words in the question and answer (0.20 and 0.18). All other correlations were below 0. 15. These factors could be integrated into an estimate for age-appropriateness, clarity, and grammar, potentially reducing the need for raters. Table 3 provides statistics on each corpus. MC160 and MC500 are similar in average number of words per story, question, and answer, as well as the median writing time. The most commonly used 199 Baseline Algorithms Require: Passage P, set of passage words PW, ith word in passage Pi, set of words in question Q, set of words in hypothesized answers A1..4, and set of stop words U, Define: ( ) ∑( ) Define: ( ) ( ( )). Algorithm 1 Sliding Window Algorithm 1 Sliding Window for i= 1to 4 do | | ∑ | |{ ( ) end for return Algorithm 2 Distance Based for i= 1to 4 do ( ) (( ) ) if | | else or | | | |( ), where ()is the minimum number of words between an occurrence of q and an occurrence of a in P, plus one. end if end for return Algorithm Return SW Algorithm SW+D Return Figure 4. The two lexical-based algorithms used for the baselines. nouns in MC500 are: day, friend, time, home, house, mother, dog, mom, school, dad, cat, tree, and boy. The stories vary widely in theme. The first 10 stories of the randomly-ordered MC500 set are about: travelling to Miami to visit friends, waking up and saying hello to pets, a bully on a schoolyard, visiting a farm, collecting insects at Grandpa’s house, planning a friend’s birthday party, selecting clothes for a school dance, keeping animals from eating your ice cream, animals ordering food, and adventures of a boy and his dog. TSMaAiblCnuge1tli460.Per5TcS9reW.an54it360 caonrQdS6e’W7c8Dst.4+1e7vfD5o:rthem465S8u.W4l28t93ip Tl4e0cQsht:o’S5i76Wc e.8+2q1D95ues- tions for MC160. SW: sliding window algorithm. SW+D: combined results with sliding window and distance based algorithms. Single/Multi: questions marked by worker as requiring a single/multiple sentence(s) to answer. All differences between SW and SW+D are significant (p<0.01 using the two-tailed paired t-test). TASMabiluCnge5tli0.Pe5T4rSc92a.We18ni304t ac0noSQrd56W’8e1D.sc2+7et8Dv1fo:rt5hS1eW.85m603uTletQsiSp5W:l’76es.31+c570hDoiceS65Wq0A7u.4l5+e 3Ds- tions for MC500, notation as above. All differences between SW and SW+D are significant (p<0.01, tested as above). We randomly divided MC160 and MC500 into train, development, and test sets of 70, 30, and 60 stories and 300, 50, and 150 stories, respectively. 6 Baseline System and Results We wrote two baseline systems, both using only simple lexical features. The first system used a sliding window, matching a bag of words constructed from the question and hypothesized answer to the text. Since this ignored long range dependencies, we added a second, word-distance based algorithm. The distance-based score was simply subtracted from the window-based score to arrive at the final score (we tried scaling the distance score before subtraction but this did not improve results on the MC160 train set). The algorithms are summarized in Figure 4. A coin flip is used to break ties. The use of inverse word counts was inspired by TF-IDF. Results for MC160 and MC500 are shown in Table 4 and Table 5. The MC160 train and development sets were used for tuning. The baseline algorithm was authored without seeing any portion of MC500, so both the MC160 test set and all of 200 BRCoaTsmEelbin e d(SW+D)65 M967C. 76219506ǂ 0Test5 6M603C. 685 7320ǂ 0Test Table 6. Percent correct for MC160 and MC500 test sets. The ǂ indicates statistical significance vs. baseline (p<0.01 using the two-tailed paired t-test). MC160 combined vs. baseline has p-value 0.063. MC500 were used for testing (although we nevertheless report results on the train/test split). Note that adding the distance based algorithm improved accuracy by approximately 10% absolute on MC160 and approximately 6% on MC500. Overall, error rates on MC500 are higher than on MC160, which agrees with human performance (see Table 2), suggesting that MC500’s questions are more difficult. 7 Recognizing Textual Entailment Results We also tried using a “recognizing textual entailment” (RTE) system to answer MCTest questions. The goal of RTE (Dagan et al., 2005) is to determine whether a given statement can be inferred from a particular text. We can cast MCTest as an RTE task by converting each question-answer pair into a statement, and then selecting the answer whose statement has the highest likelihood of being entailed by the story. For example, in the sample story given in Figure 1, the second question can be converted into four statements (one for each answer), and the RTE system should select the statement “James pulled pudding off of the shelves in the grocery store” as the most likely one. For converting question-answer pairs to statements, we used the rules employed in a web-based question answering system (Cucerzan and Agichtein, 2005). For RTE, we used BIUTEE (Stern and Dagan, 2011), which performs better than the median system in the past four RTE competitions. We ran BIUTEE both in its default configuration, as well as with its optional additional data sources (FrameNet, ReVerb, DIRT, and others as found on the BIUTEE home page). The default configuration performed better so we present its results here. The results in Table 6 show that the RTE method performed worse than the baseline. We also combined the baseline and RTE system by training BIUTEE on the train set and using the development set to optimize a linear combination of BIUTEE with the baseline; the combined system outperforms either component system on MC500. It is possible that with some tuning, an RTE system will outperform our baseline system. Nevertheless, these RTE results, and the performance of the baseline system, both suggest that the reading comprehension task described here will not be trivially solved by off-the-shelf techniques. 8 Making Data and Results an Ongoing Resource Our goal in constructing this data is to encourage research and innovation in the machine comprehension of text. Thus, we have made both MC160 and MC500 freely available for download at http://research.microsoft.com/mct. To our knowledge, these are the largest copyright-free reading comprehension data sets publicly available. To further encourage research on these data, we will be continually updating the webpage with the bestknown published results to date, along with pointers to those publications. One of the difficulties in making progress on a particular task is implementing previous work in order to apply improvements to it. To mitigate this difficulty, we are encouraging researchers who use the data to (optionally) provide per-answer scores from their system. Doing so has three benefits: (a) a new system can be measured in the context of the errors made by the previous systems, allowing each research effort to incrementally add useful functionality without needing to also re-implement the current state-of-the-art; (b) it allows system performance to be measured using paired statistical testing, which will substantially increase the ability to determine whether small improvements are significant; and (c) it enables researchers to perform error analysis on any of the existing systems, simplifying the process of identifying and tackling common sources of error. We will also periodically ensemble the known systems using standard machine learning techniques and make those results available as well (unless the existing state-of-theart already does such ensembling). The released data contains the stories and questions, as well as the results from workers who rated 201 the stories and took the tests. The latter may be used, for example, to measure machine performance vs. human performance on a per-question basis (i.e., does your algorithm make similar mistakes to humans?), or vs. the judged clarity of each story. The ratings, as well as whether a question needs multiple sentences to answer, should typically only be used in evaluation, since such information is not generally available for most text. We will also provide an anonymized author id for each story, which could allow additional research such as using other works by the same author when understanding a story, or research on authorship attribution (e.g., Stamatatos, 2009). 9 Future Work We plan to use this dataset to evaluate approaches for machine comprehension, but are making it available now so that others may do the same. If MCTest is used we will collect more story sets and will continue to refine the collection process. One interesting research direction is ensuring that the questions are difficult enough to challenge state-ofthe-art techniques as they develop. One idea for this is to apply existing techniques automatically during story set creation to see whether a question is too easily answered by a machine. By requiring authors to create difficult questions, each data set will be made more and more difficult (but still answerable by humans) as the state-of-the-art methods advance. We will also experiment with timing the raters as they answer questions to see if we can find those that are too easy for people to answer. Removing such questions may increase the difficulty for machines as well. Additionally, any divergence between how easily a person answers a question vs. how easily a machine does may point toward new techniques for improving machine comprehension; we plan to conduct research in this direction as well as make any such data available for others. 10 Conclusion We present the MCTest dataset in the hope that it will help spur research into the machine comprehension of text. The metric (the accuracy on the question sets) is clearly defined, and on that metric, lexical baseline algorithms only attain approximately 58% correct on test data (the MC500 set) as opposed to the 100% correct that the majority of crowd-sourced judges attain. A key component of MCTest is the scalable design: we have shown that data whose quality approaches that of expertly cu- rated data can be generated using crowd sourcing coupled with expert correction of worker-identified errors. Should MCTest prove useful to the community, we will continue to gather data, both to increase the corpus size, and to keep the test sets fresh. The data is available at http://research.microsoft.com/mct and any submitted results will be posted there too. Because submissions will be requested to include the score for each test item, researchers will easily be able to compare their systems with those of others, and investigation of ensembles comprised of components from several different teams will be straightforward. MCTest also contains supplementary material that researchers may find useful, such as worker accuracies on a grammar test and crowd-sourced measures of the quality of their stories. Acknowledgments We would like to thank Silviu Cucerzan and Lucy Vanderwende for their help with converting questions to statements and other useful discussions. References M. Agarwal and P. Mannem. 2011. Automatic Gap-fill Question Generation from Text Books. In Proceed- ings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications, 56–64. E. Breck, M. Light, G.S.Mann, E. Riloff, B. Brown, P. Anand, M. Rooth M. Thelen. 2001. Looking under the hood: Tools for diagnosing your question answering engine. In Proceedings of the workshop on Opendomain question answering, 12, 1-8. E. Charniak. 1972. Toward a Model of Children’s Story Comprehension. Technical Report, 266, MIT Artificial Intelligence Laboratory, Cambridge, MA. P. Clark, P. Harrison, and X. Yao. An Entailment-Based Approach to the QA4MRE Challenge. 2012. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF) 2012. S. Cucerzan and E. Agichtein. 2005. Factoid Question Answering over Unstructured and Structured Content on the Web. In Proceedings of the Fourteenth Text Retrieval Conference (TREC). I. Dagan, O. Glickman, and B. Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini, F. d'Alché-Buc (Eds.), Machine Learning 202 Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer. D. Goldwasser, R. Reichart, J. Clarke, D. Roth. 2011. Confidence Driven Unsupervised Semantic Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 1486-1495. E. Grois and D.C. Wilkins. 2005. Learning Strategies for Story Comprehension: A Reinforcement Learning Approach. In Proceedings of the Twenty Second International Conference on Machine Learning, 257264. S.M. Harabagiu, S.J. Maiorano, and M.A. Pasca. 2003. Open-Domain Textual Question Answering Techniques. Natural Language Engineering, 9(3): 1-38. Cambridge University Press, Cambridge, UK. L. Hirschman, M. Light, E. Breck, and J.D. Burger. 1999. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), 325-332. J. Horton and L. Chilton. 2010. The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM Conference on Electronic Commerce, 209-218. V. Kuperman, H. Stadthagen-Gonzalez, M. Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4):978-990. J.L. Leidner, T. Dalmas, B. Webber, J. Bos, C. Grover. 2003. Automatic Multi-Layer Corpus Annotation for Evaluating Question Answering Methods: CBC4Kids. In Proceedings of the 3rd International Workshop on Linguistically Interpreted Corpora. E.T. Mueller. 2010. Story Understanding Resources. http://xenia.media.mit.edu/~mueller/storyund/storyre s.html. G. Paolacci, J. Chandler, and P. Iperirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making. 5(5):41 1-419. E. Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci., 60:538– 556. A. Stern and I. Dagan. 2011. A Confidence Model for Syntactically-Motivated Entailment Proofs. In Proceedings of Recent Advances in Natural Language Processing (RANLP). L.R. Tang and R.J. Mooney. 2001. Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing. In Proceedings of the European Conference on Machine Learning (ECML), 466-477. G. Tur, D. Hakkani-Tur, and L.Heck. 2010. What is left to be understood in ATIS? Spoken Language Technology Workshop, 19-24. E.M. Voorhees and D.M. Tice. 1999. The TREC-8 Question Answering Track Evaluation. In Proceedings of the Eighth Text Retrieval Conference (TREC8). 12th Wellner, L. Ferro, W. Greiff, and L. Hirschman. 2005. Reading comprehension tests for computerbased understand evaluation. Natural Language Engineering, 12(4):305-334. Cambridge University Press, Cambridge, UK. J.M. Zelle and R.J. Mooney. 1996. Learning to Parse Database Queries using Inductive Logic Programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI), 10501055. B. Zettlemoyer and M. Collins. 2009. Learning Context-Dependent Mappings from Sentences to Logical Form. In Proceedings of the 47th Annual Meeting of the Association for Computation Linguistics (ACL), 976-984. G. Zweig and C.J.C. Burges. 2012. A Challenge Set for Advancing Language Modeling. In Proceedings of the Workshop on the Future of Language Modeling for HLT, NAACL-HLT. L.S. 203

3 0.77276289 155 emnlp-2013-Question Difficulty Estimation in Community Question Answering Services

Author: Jing Liu ; Quan Wang ; Chin-Yew Lin ; Hsiao-Wuen Hon

Abstract: In this paper, we address the problem of estimating question difficulty in community question answering services. We propose a competition-based model for estimating question difficulty by leveraging pairwise comparisons between questions and users. Our experimental results show that our model significantly outperforms a PageRank-based approach. Most importantly, our analysis shows that the text of question descriptions reflects the question difficulty. This implies the possibility of predicting question difficulty from the text of question descriptions.

4 0.75473911 180 emnlp-2013-The Answer is at your Fingertips: Improving Passage Retrieval for Web Question Answering with Search Behavior Data

Author: Mikhail Ageev ; Dmitry Lagun ; Eugene Agichtein

Abstract: Passage retrieval is a crucial first step of automatic Question Answering (QA). While existing passage retrieval algorithms are effective at selecting document passages most similar to the question, or those that contain the expected answer types, they do not take into account which parts of the document the searchers actually found useful. We propose, to the best of our knowledge, the first successful attempt to incorporate searcher examination data into passage retrieval for question answering. Specifically, we exploit detailed examination data, such as mouse cursor movements and scrolling, to infer the parts of the document the searcher found interesting, and then incorporate this signal into passage retrieval for QA. Our extensive experiments and analysis demonstrate that our method significantly improves passage retrieval, compared to using textual features alone. As an additional contribution, we make available to the research community the code and the search behavior data used in this study, with the hope of encouraging further research in this area.

5 0.56408077 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations

Author: Joo-Kyung Kim ; Marie-Catherine de Marneffe

Abstract: Continuous space word representations extracted from neural network language models have been used effectively for natural language processing, but until recently it was not clear whether the spatial relationships of such representations were interpretable. Mikolov et al. (2013) show that these representations do capture syntactic and semantic regularities. Here, we push the interpretation of continuous space word representations further by demonstrating that vector offsets can be used to derive adjectival scales (e.g., okay < good < excellent). We evaluate the scales on the indirect answers to yes/no questions corpus (de Marneffe et al., 2010). We obtain 72.8% accuracy, which outperforms previous results (∼60%) on tichihs corpus aornmd highlights sth rees quality o6f0% the) scales extracted, providing further support that the continuous space word representations are meaningful.

6 0.49683926 17 emnlp-2013-A Walk-Based Semantically Enriched Tree Kernel Over Distributed Word Representations

7 0.42697486 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions

8 0.41460699 188 emnlp-2013-Tree Kernel-based Negation and Speculation Scope Detection with Structured Syntactic Parse Features

9 0.39322397 7 emnlp-2013-A Hierarchical Entity-Based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers

10 0.39051104 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task

11 0.38353038 18 emnlp-2013-A temporal model of text periodicities using Gaussian Processes

12 0.35610417 49 emnlp-2013-Combining Generative and Discriminative Model Scores for Distant Supervision

13 0.33197209 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!

14 0.32834506 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment

15 0.3275722 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries

16 0.32754999 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

17 0.29954553 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs

18 0.28497124 45 emnlp-2013-Chinese Zero Pronoun Resolution: Some Recent Advances

19 0.283337 108 emnlp-2013-Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data

20 0.27668855 203 emnlp-2013-With Blinkers on: Robust Prediction of Eye Movements across Readers


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.028), (18, 0.033), (22, 0.038), (29, 0.011), (30, 0.086), (50, 0.016), (51, 0.159), (66, 0.035), (71, 0.02), (75, 0.382), (77, 0.023), (96, 0.055), (97, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94295448 147 emnlp-2013-Optimized Event Storyline Generation based on Mixture-Event-Aspect Model

Author: Lifu Huang ; Lian'en Huang

Abstract: Recently, much research focuses on event storyline generation, which aims to produce a concise, global and temporal event summary from a collection of articles. Generally, each event contains multiple sub-events and the storyline should be composed by the component summaries of all the sub-events. However, different sub-events have different part-whole relationship with the major event, which is important to correspond to users’ interests but seldom considered in previous work. To distinguish different types of sub-events, we propose a mixture-event-aspect model which models different sub-events into local and global aspects. Combining these local/global aspects with summarization requirements together, we utilize an optimization method to generate the component summaries along the timeline. We develop experimental systems on 6 distinctively different datasets. Evaluation and comparison results indicate the effectiveness of our proposed method.

2 0.91429865 93 emnlp-2013-Harvesting Parallel News Streams to Generate Paraphrases of Event Relations

Author: Congle Zhang ; Daniel S. Weld

Abstract: The distributional hypothesis, which states that words that occur in similar contexts tend to have similar meanings, has inspired several Web mining algorithms for paraphrasing semantically equivalent phrases. Unfortunately, these methods have several drawbacks, such as confusing synonyms with antonyms and causes with effects. This paper introduces three Temporal Correspondence Heuristics, that characterize regularities in parallel news streams, and shows how they may be used to generate high precision paraphrases for event relations. We encode the heuristics in a probabilistic graphical model to create the NEWSSPIKE algorithm for mining news streams. We present experiments demonstrating that NEWSSPIKE significantly outperforms several competitive baselines. In order to spur further research, we provide a large annotated corpus of timestamped news arti- cles as well as the paraphrases produced by NEWSSPIKE.

same-paper 3 0.90706646 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction

Author: Aliaksei Severyn ; Alessandro Moschitti

Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.

4 0.85539615 117 emnlp-2013-Latent Anaphora Resolution for Cross-Lingual Pronoun Prediction

Author: Christian Hardmeier ; Jorg Tiedemann ; Joakim Nivre

Abstract: This paper addresses the task of predicting the correct French translations of third-person subject pronouns in English discourse, a problem that is relevant as a prerequisite for machine translation and that requires anaphora resolution. We present an approach based on neural networks that models anaphoric links as latent variables and show that its performance is competitive with that of a system with separate anaphora resolution while not requiring any coreference-annotated training data. This demonstrates that the information contained in parallel bitexts can successfully be used to acquire knowledge about pronominal anaphora in an unsupervised way. 1 Motivation When texts are translated from one language into another, the translation reconstructs the meaning or function of the source text with the means of the target language. Generally, this has the effect that the entities occurring in the translation and their mutual relations will display similar patterns as the entities in the source text. In particular, coreference patterns tend to be very similar in translations of a text, and this fact has been exploited with good results to project coreference annotations from one language into another by using word alignments (Postolache et al., 2006; Rahman and Ng, 2012). On the other hand, what is true in general need not be true for all types of linguistic elements. For instance, a substantial percentage ofthe English thirdperson subject pronouns he, she, it and they does not get realised as pronouns in French translations (Hardmeier, 2012). Moreover, it has been recognised 380 by various authors in the statistical machine translation (SMT) community (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) that pronoun translation is a difficult problem because, even when a pronoun does get translated as a pronoun, it may require choosing the correct word form based on agreement features that are not easily pre- dictable from the source text. The work presented in this paper investigates the problem of cross-lingual pronoun prediction for English-French. Given an English pronoun and its discourse context as well as a French translation of the same discourse and word alignments between the two languages, we attempt to predict the French word aligned to the English pronoun. As far as we know, this task has not been addressed in the literature before. In our opinion, it is interesting for several reasons. By studying pronoun prediction as a task in its own right, we hope to contribute towards a better understanding of pronoun translation with a longterm view to improving the performance of SMT systems. Moreover, we believe that this task can lead to interesting insights about anaphora resolution in a multi-lingual context. In particular, we show in this paper that the pronoun prediction task makes it possible to model the resolution of pronominal anaphora as a latent variable and opens up a way to solve a task relying on anaphora resolution without using any data annotated for anaphora. This is what we consider the main contribution of our present work. We start by modelling cross-lingual pronoun pre- diction as an independent machine learning task after doing anaphora resolution in the source language (English) using the BART software (Broscheit et al., 2010). We show that it is difficult to achieve satisfactory performance with standard maximumProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 3t8ic0s–391, The latest version released in March is equipped with ...It is sold at ... La dernière version lancée en mars est dotée de ... • est vendue ... • Figure 1: Task setup entropy classifiers especially for low-frequency pronouns such as the French feminine plural pronoun elles. We propose a neural network classifier that achieves better precision and recall and manages to make reasonable predictions for all pronoun categories in many cases. We then go on to extend our neural network architecture to include anaphoric links as latent variables. We demonstrate that our classifier, now with its own source language anaphora resolver, can be trained successfully with backpropagation. In this setup, we no longer use the machine learning component included in the external coreference resolution system (BART) to predict anaphoric links. Anaphora resolution is done by our neural network classifier and requires only some quantity of word-aligned parallel data for training, completely obviating the need for a coreference-annotated training set. 2 Task Setup The overall setup of the classification task we address in this paper is shown in Figure 1. We are given an English discourse containing a pronoun along with its French translation and word alignments between the two languages, which in our case were computed automatically using a standard SMT pipeline with GIZA++ (Och and Ney, 2003). We focus on the four English third-person subject pronouns he, she, it and they. The output of the classifier is a multinomial distribution over six classes: the four French subject pronouns il, elle, ils and elles, corresponding to masculine and feminine singular and plural, respectively; the impersonal pronoun ce/c’, which occurs in some very frequent constructions such as c’est (it is); and a sixth class OTHER, which indicates that none of these pronouns was used. In general, a pronoun may be aligned to multiple words; in this case, a training example is counted as a positive example for a class if the target word occurs among the words aligned to the pronoun, irrespective of the presence of other 381 word candidate training ex. verseiol ena0 0 1 01 10 0 0 .0510 .50 p 12= . 910. 5.9 050 Figure 2: Antecedent feature aggregation aligned tokens. This task setup resembles the problem that an SMT system would have to solve to make informed choices when translating pronouns, an aspect oftranslation neglected by most existing SMT systems. An important difference between the SMT setup and our own classifiers is that we use context from humanmade translations for prediction. This potentially makes the task both easier and more difficult; easier, because the context can be relied on to be correctly translated, and more difficult, because human translators frequently create less literal translations than an SMT system would. Integrating pronoun prediction into the translation process would require significant changes to the standard SMT decoding setup in order to take long-range dependencies in the target language into account, which is why we do not address this issue in our current work. In all the experiments presented in this paper, we used features from two different sources: Anaphora context features describe the source language pronoun and its immediate context consisting of three words to its left and three words to its right. They are encoded as vectors whose dimensionality is equal to the source vocabulary size with a single non-zero component indicating the word referred to (one-hot vectors). Antecedent features describe an antecedent candidate. Antecedent candidates are represented by the target language words aligned to the syntactic head of the source language markable TED News ce 16.3 % 6.4 % elle 7.1 % 10.1 % elles 3.0 % 3.9 % il 17.1 % 26.5 % ils 15.6 % 15.1 % OTHER 40.9 % 38.0 % – – Table 1: Distribution of classes in the training data noun phrase as identified by the Collins head finder (Collins, 1999). The different handling of anaphora context features and antecedent features is due to the fact that we always consider a constant number of context words on the source side, whereas the number of word vectors to be considered depends on the number of antecedent candidates and on the number of target words aligned to each antecedent. The encoding of the antecedent features is illustrated in Figure 2 for a training example with two antecedent candidates translated to elle and la version, respectively. The target words are represented as one-hot vectors with the dimensionality of the target language vocabulary. These vectors are then averaged to yield a single vector per antecedent candidate. Finally, the vectors of all candidates for a given training example are weighted by the probabilities assigned to them by the anaphora resolver (p1 and p2) and summed to yield a single vector per training example. 3 Data Sets and External Tools We run experiments with two different test sets. The TED data set consists of around 2.6 million tokens of lecture subtitles released in the WIT3 corpus (Cettolo et al., 2012). The WIT3 training data yields 71,052 examples, which were randomly partitioned into a training set of 63,228 examples and a test set of 7,824 examples. The official WIT3 development and test sets were not used in our experiments. The news-commentary data set is version 6 of the parallel news-commentary corpus released as a part of the WMT 2011training data1 . It contains around 2.8 million tokens ofnews text and yields 3 1,017 data points, 1http: //www. statmt .org/wmt11/translation-task. html (3 July 2013). 382 which were randomly split into 27,900 training examples and 3,117 test instances. The distribution of the classes in the two training sets is shown in Table 1. One thing to note is the dominance of the OTHER class, which pools together such different phenomena as translations with other pronouns not in our list (e. g., celui-ci) and translations with full noun phrases instead of pronouns. Splitting this group into more meaningful subcategories is not straightforward and must be left to future work. The feature setup of all our classifiers requires the detection of potential antecedents and the extraction of features pairing anaphoric pronouns with antecedent candidates. Some of our experiments also rely on an external anaphora resolution component. We use the open-source anaphora resolver BART to generate this information. BART (Broscheit et al., 2010) is an anaphora resolution toolkit consisting of a markable detection and feature extraction pipeline based on a variety of standard natural language processing (NLP) tools and a machine learning component to predict coreference links including both pronominal anaphora and noun-noun coreference. In our experiments, we always use BART’s markable detection and feature extraction machinery. Markable detection is based on the identification of noun phrases in constituency parses generated with the Stanford parser (Klein and Manning, 2003). The set of features extracted by BART is an extension of the widely used mention-pair anaphora resolution feature set by Soon et al. (2001) (see below, Section 6). In the experiments of the next two sections, we also use BART to predict anaphoric links for pronouns. The model used with BART is a maximum entropy ranker trained on the ACE02-npaper corpus (LDC2003T1 1). In order to obtain a probability distribution over antecedent candidates rather than onebest predictions or coreference sets, we modified the ranking component with which BART resolves pronouns to normalise and output the scores assigned by the ranker to all candidates instead of picking the highest-scoring candidate. 4 Baseline Classifiers In order to create a simple, but reasonable baseline for our task, we trained a maximum entropy (ME) ce TED (Accuracy: 0.685) P R 0.593 0.728 F 0.654 elle 0.798 0.523 elles 0.812 0.164 il 0.764 0.550 ils 0.632 0.949 OTHER 0.724 0.692 News commentary (Accuracy: 0.576) ce elle elles il ils OTHER P 0.508 0.530 0.538 0.600 0.593 0.564 R 0.294 0.312 0.062 0.666 0.769 0.609 Table 2: Maximum entropy classifier results 0.632 0.273 0.639 0.759 0.708 F 0.373 0.393 0.111 0.631 0.670 0.586 TED (Accuracy: 0.700) P R ce 0.634 0.747 elle 0.756 0.617 elles 0.679 0.319 il 0.719 0.591 ils 0.663 0.940 OTHER 0.743 0.678 News commentary (Accuracy: 0.576) F 0.686 0.679 0.434 0.649 0.778 0.709 P 0.477 0.498 F 0.400 0.444 ce elle R 0.344 0.401 elles il ils OTHER 0.565 0.655 0.570 0.567 0.116 0.626 0.834 0.573 0.193 0.640 0.677 0.570 Table 3: Neural network classifier with anaphoras resolved by BART classifier with the MegaM software package2 using the features described in the previous section and the anaphora links found by BART. Results are shown in Table 2. The baseline results show an overall higher accuracy for the TED data than for the newscommentary data. While the precision is above 50 % in all categories and considerably higher in some, recall varies widely. The pronoun elles is particularly interesting. This is the feminine plural of the personal pronoun, and it usually corresponds to the English pronoun they, which is not marked for gender. In French, elles is a marked choice which is only used if the antecedent exclusively refers to females or feminine-gendered objects. The presence of a single item with masculine grammatical gender in the antecedent will trigger the use of the masculine plural pronoun ils instead. This distinction cannot be predicted from the English source pronoun or its context; making correct predictions requires knowledge about the antecedent of the pronoun. Moreover, elles is a low-frequency pronoun. There are only 1,909 occurrences of this pro2http : //www . umiacs .umd .edu/~hal/megam/ (20 June 2013). 383 noun in the TED training data, and 1,077 in the newscommentary training set. Because of these special properties of the feminine plural class, we argue that the performance of a classifier on elles is a good indicator ofhow well it can represent relevant knowledge about pronominal anaphora as opposed to overfitting to source contexts or acting on prior assumptions about class frequencies. In accordance with the general linguistic preference for ils, the classifier tends to predict ils much more often than elles when encountering an English plural pronoun. This is reflected in the fact that elles has much lower recall than ils. Clearly, the classifier achieves a good part of its accuracy by making ma- jority choices without exploiting deeper knowledge about the antecedents of pronouns. An additional experiment with a subset of 27,900 training examples from the TED data confirms that the difference between TED and news commentaries is not just an effect of training data size, but that TED data is genuinely easier to predict than news commentaries. In the reduced data TED condition, the classifier achieves an accuracy of 0.673. Precision and recall of all classifiers are much closer to the Figure 3: Neural network for pronoun prediction large-data TED condition than to the news commentary experiments, except for elles, where we obtain an F-score of 0.072 (P 0.818, R 0.038), indicating that small training data size is a serious problem for this low-frequency class. 5 Neural Network Classifier In the previous section, we saw that a simple multiclass maximum entropy classifier, while making correct predictions for much of the data set, has a significant bias towards making majority class decisions, relying more on prior assumptions about the frequency distribution of the classes than on antecedent features when handling examples of less frequent classes. In order to create a system that can be trained to rely more explicitly on antecedent information, we created a neural network classifier for our task. The introduction of a hidden layer should enable the classifier to learn abstract concepts such as gender and number that are useful across multiple output categories, so that the performance of sparsely represented classes can benefit from the training examples of the more frequent classes. The overall structure of the network is shown in Figure 3. As inputs, the network takes the same features that were available to the baseline ME classifier, based on the source pronoun (P) with three words of context to its left (L1 to L3) and three words to its right (R1 to R3) as well as the words aligned to the syntactic head words of all possible antecedent candidates as found by BART (A). All words are 384 encoded as one-hot vectors whose dimensionality is equal to the vocabulary size. If multiple words are aligned to the syntactic head of an antecedent candidate, their word vectors are averaged with uniform weights. The resulting vectors for each antecedent are then averaged with weights defined by the posterior distribution of the anaphora resolver in BART (p1 to p3). The network has two hidden layers. The first layer (E) maps the input word vectors to a low-dimensional representation. In this layer, the embedding weights for all the source language vectors (the pronoun and its 6 context words) are tied, so if two words are the same, they are mapped to the same lowerdimensional embedding irrespective of their position relative to the pronoun. The embedding of the antecedent word vectors is independent, as these word vectors represent target language words. The entire embedding layer is then mapped to another hidden layer (H), which is in turn connected to a softmax output layer (S) with 6 outputs representing the classes ce, elle, elles, il, ils and OTHER. The non-linearity of both hidden layers is the logistic sigmoid function, f(x) = 1/(1 + e−x). In all experiments reported in this paper, the dimensionality of the source and target language word embeddings is 20, resulting in a total embedding layer size of 160, and the size of the last hidden layer is equal to 50. These sizes are fairly small. In experiments with larger layer sizes, we were able to obtain similar, but no better results. The neural network is trained with mini-batch stochastic gradient descent with backpropagated gradients using the RMSPROP algorithm with crossentropy as the objective function.3 In contrast to standard gradient descent, RMSPROP normalises the magnitude of the gradient components by dividing them by a root-mean-square moving average. We found this led to faster convergence. Other features of our training algorithm include the use of momentum to even out gradient oscillations, adaptive learning rates for each weight as well as adaptation of the global learning rate as a function of current training progress. The network is regularised with an ‘2 weight penalty. Good settings of the initial learning rate and the weight cost parameter (both around 0.001 in most experiments) were found by manual experi- mentation. Generally, we train our networks for 300 epochs, compute the validation error on a held-out set of some 10 % of the training data after each epoch and use the model that achieved the lowest validation error for testing. Since the source context features are very informative and it is comparatively more difficult to learn from the antecedents, the network sometimes had a tendency to overfit to the source features and disregard antecedent information. We found that this problem can be solved effectively by presenting a part of the training without any source features, forcing the network to learn from the information contained in the antecedents. In all experiments in this paper, we zero out all source features (input layers P, L1to L3 and R1 to R3) with a probability of 50 % in each training example. At test time, no information is zeroed out. Classification results with this network are shown in Table 3. We note that the accuracy has increased slightly for the TED test set and remains exactly the same for the news commentary corpus. However, a closer look on the results for individual classes reveals that the neural network makes better predictions for almost all classes. In terms of F-score, the only class that becomes slightly worse is the OTHER class for the news commentary corpus because of lower recall, indicating that the neural network classifier is less biased towards using the uninformative OTHER 3Our training procedure is greatly inspired by a series of online lectures held by Geoffrey Hinton in 2012 (https : //www . coursera. .org/course/neuralnets, 10 September 2013). 385 category. Recall for elle and elles increases considerably, but especially for elles it is still quite low. The increase in recall comes with some loss in precision, but the net effect on F-score is clearly positive. 6 Latent Anaphora Resolution Considering Figure 1 again, we note that the bilingual setting of our classification task adds some information not available to the monolingual anaphora resolver that can be helpful when determining the correct antecedent for a given pronoun. Knowing the gender of the translation of a pronoun limits the set of possible antecedents to those whose translation is morphologically compatible with the target language pronoun. We can exploit this fact to learn how to resolve anaphoric pronouns without requiring data with manually annotated anaphoric links. To achieve this, we extend our neural network with a component to predict the probability of each antecedent candidate to be the correct antecedent (Figure 4). The extended network is identical to the previous version except for the upper left part dealing with anaphoric link features. The only difference between the two networks is the fact that anaphora resolution is now performed by a part of our neural network itself instead of being done by an external module and provided to the classifier as an input. In this setup, we still use some parts of the BART toolkit to extract markables and compute features. However, we do not make use of the machine learning component in BART that makes the actual predictions. Since this is the only component trained on coreference-annotated data in a typical BART configuration, no coreference annotations are used anywhere in our system even though we continue to rely on the external anaphora resolver for preprocessing to avoid implementing our own markable and feature extractors and to make comparison easier. For each candidate markable identified by BART’s preprocessing pipeline, the anaphora resolution model receives as input a link feature vector (T) describing relevant aspects of the antecedent candidateanaphora pair. This feature vector is generated by the feature extraction machinery in BART and includes a standard feature set for coreference resolution partially based on work by Soon et al. (2001). We use the following feature extractors in BART, each of Figure 4: Neural network with latent anaphora resolution which can generate multiple features: Anaphora mention type Gender match Number match String match Alias feature (Soon et al., 2001) Appositive position feature (Soon et al., 2001) Semantic class (Soon et al., 2001) – – – – – – – Semantic class match Binary distance feature Antecedent is first mention in sentence Our baseline set of features was borrowed wholesale from a working coreference system and includes some features that are not relevant to the task at hand, e. g., features indicating that the anaphora is a pronoun, is not a named entity, etc. After removing all features that assume constant values in the training set when resolving antecedents for the set of pronouns we consider, we are left with a basic set of 37 anaphoric link features that are fed as inputs to our network. These features are exactly the same as those available to the anaphora resolution classifier in the BART system used in the previous section. Each training example for our network can have an arbitrary number of antecedent candidates, each of which is described by an antecedent word vector (A) and by an anaphoric link vector (T). The anaphoric link features are first mapped to a regular hidden layer with logistic sigmoid units (U). The activations of the hidden units are then mapped to a single value, which – – – 386 functions as an element in a softmax layer over all an- tecedent candidates (V). This softmax layer assigns a probability to each antecedent candidate, which we then use to compute a weighted average over the antecedent word vector, replacing the probabilities pi in Figures 2 and 3. At training time, the network’s anaphora resolution component is trained in exactly the same way as the rest of the network. The error signal from the embedding layer is backpropagated both to the weight matrix defining the antecedent word embedding and to the anaphora resolution subnetwork. Note that the number of weights in the network is the same for all training examples even though the number of antecedent candidates varies because all weights related to antecedent word features and anaphoric link features are shared between all antecedent candidates. One slightly uncommon feature of our neural network is that it contains an internal softmax layer to generate normalised probabilities over all possible antecedent candidates. Moreover, weights are shared between all antecedent candidates, so the inputs of our internal softmax layer share dependencies on the same weight variables. When computing derivatives with backpropagation, these shared dependen- cies must be taken into account. In particular, the outputs yi ofthe antecedent resolution layer are the result of a softmax applied to functions of some shared variables q: yi=∑kexepxp fi( fkq()q) (1) The derivatives of any yi with respect to q, which can be any of the weights in the anaphora resolution subnetwork, have dependencies on the derivatives of the other softmax inputs with respect to q: ∂∂yqi= yi ∂ f∂i(qq)−∑kyk∂ f∂k(qq)! (2) This makes the implementation of backpropagation for this part of the network somewhat more complicated, but in the case of our networks, it has no major impact on training time. Experimental results for this network are shown in Table 4. Compared with Table 3, we note that the overall accuracy is only very slightly lower for TED, and for the news commentaries it is actually better. When it comes to F-scores, the performance for elles improves by a small amount, while the effect on the other classes is a bit more mixed. Even where it gets worse, the differences are not dramatic considering that we eliminated a very knowledge-rich resource from the training process. This demonstrates that it is possible, in our classification task, to obtain good results without using any data manually annotated for anaphora and to rely entirely on unsupervised latent anaphora resolution. 7 Further Improvements The results presented in the preceding section represent a clear improvement over the ME classifiers in Table 2, even though the overall accuracy increased only slightly. Not only does our neural network classifier achieve better results on the classification task at hand without requiring an anaphora resolution classifier trained on manually annotated data, but it performs clearly better for the feminine categories that reflect minority choices requiring knowledge about the antecedents. Nevertheless, the performance is still not entirely satisfactory. By subjecting the output of our classifier on a development set to a manual error analysis, we found that a fairly large number oferrors belong to two error types: On the one hand, the preprocessing pipeline used to identify antecedent candidates does not always include the correct antecedent in the set presented to the neural network. Whenever this occurs, it is obvious that the classifier cannot possibly find 387 the correct antecedent. Out of 76 examples of the category elles that had been mistakenly predicted as ils, we found that 43 suffered from this problem. In other classes, the problem seems to be somewhat less common, but it still exists. On the other hand, in many cases (23 out of 76 for the category mentioned before) the anaphora resolution subnetwork does identify an antecedent manually recognised to belong to the right gender/number group, but still predicts an incorrect pronoun. This may indicate that the network has difficulties learning a correct gender/number representation for all words in the vocabulary. 7.1 Relaxing Markable Extraction The pipeline we use to extract potential antecedent candidates is borrowed from the BART anaphora resolution toolkit. BART uses a syntactic parser to identify noun phrases as markables. When extracting antecedent candidates for coreference prediction, it starts by considering a window consisting of the sentence in which the anaphoric pronoun is located and the two immediately preceding sentences. Markables in this window are checked for morphological compatibility in terms of gender and number with the anaphoric pronoun, and only compatible markables are extracted as antecedent candidates. If no compatible markables are found in the initial window, the window is successively enlarged one sentence at a time until at least one suitable markable is found. Our error analysis shows that this procedure misses some relevant markables both because the initial two-sentence extraction window is too small and because the morphological compatibility check incorrectly filters away some markables that should have been considered as candidates. By contrast, the extraction procedure does extract quite a number of first and second person noun phrases (I, we, you and their oblique forms) in the TED talks which are extremely unlikely to be the antecedent of a later occurrence of he, she, it or they. As a first step, we therefore adjust the extraction criteria to our task by increasing the initial extraction window to five sentences, excluding first and second person markables and removing the morphological compatibility requirement. The compatibility check is still used to control expansion of the extraction window, but it is no longer applied to filter the extracted markables. This increases the accuracy to 0.701 for TED and 0.602 for the news TED (Accuracy: 0.696) P R ce 0.618 0.722 elle 0.754 0.548 elles 0.737 0.340 il 0.718 0.629 ils 0.652 0.916 OTHER 0.741 0.682 F 0.666 0.635 0.465 0.670 0.761 0.711 News commentary (Accuracy: 0.597) ce elle elles il ils OTHER P 0.419 0.547 0.539 0.623 0.596 0.614 R 0.368 0.460 0.135 0.719 0.783 0.544 F 0.392 0.500 0.215 0.667 0.677 0.577 Table 4: Neural network classifier with latent anaphora resolution TED (Accuracy: 0.713) ce elle P 0.61 1 0.749 R 0.723 0.596 F 0.662 0.664 elles 0.602 0.616 il 0.733 0.638 ils 0.710 0.884 OTHER 0.760 0.704 News commentary (Accuracy: 0.626) ce elle elles il ils OTHER P 0.492 0.526 0.547 0.599 0.671 0.681 Table 5: Final classifier R 0.324 0.439 0.558 0.757 0.878 0.526 0.609 0.682 0.788 0.731 F 0.391 0.478 0.552 0.669 0.761 0.594 results commentaries, while the performance for elles im- proves to F-scores of 0.531 (TED; P 0.690, R 0.432) and 0.304 (News commentaries; P 0.444, R 0.231), respectively. Note that these and all the following results are not directly comparable to the ME baseline results in Table 2, since they include modifications and improvements to the training data extraction procedure that might possibly lead to benefits in the ME setting as well. 7.2 Adding Lexicon Knowledge In order to make it easier for the classifier to identify the gender and number properties of infrequent words, we extend the word vectors with features indicating possible morphological features for each word. In early experiments with ME classifiers, we found that our attempts to do proper gender and number tagging in French text did not improve classification performance noticeably, presumably because the annotation was too noisy. In more recent experiments, we just add features indicating all possible morphological interpretations of each word, rather than trying to disambiguate them. To do this, we look up the morphological annotations of the French words in the Lefff dictionary (Sagot et al., 2006) and intro- 388 duce a set of new binary features to indicate whether a particular reading of a word occurs in that dictionary. These features are then added to the one-hot representation of the antecedent words. Doing so improves the classifier accuracy to 0.71 1 (TED) and 0.604 (News commentaries), while the F-scores for elles reach 0.589 (TED; P 0.649, R 0.539) and 0.500 (News commentaries; P 0.545, R 0.462), respectively. 7.3 More Anaphoric Link Features Even though the modified antecedent candidate extraction with its larger context window and without the morphological filter results in better performance on both test sets, additional error analysis reveals that the classifiers has greater problems identifying the correct markable in this setting. One reason for this may be that the baseline anaphoric link feature set described above (Section 6) only includes two very rough binary distance features which indicate whether or not the anaphora and the antecedent candidate occur in the same or in immediately adjacent sentences. With the larger context window, this may be too unspecific. In our final experiment, we there- fore enable some additional features which are available in BART, but disabled in the baseline system: Distance in number of markables Distance in number of sentences Sentence distance, log-transformed Distance in number of words Part of speech of head word Most of these encode the distance between the anaphora and the antecedent candidate in more precise ways. Complete results for this final system are presented in Table 5. Including these additional features leads to another slight increase in accuracy for both corpora, with similar or increased classifier F-scores for most classes except elle in the news commentary condition. In particular, we should like to point out the performance of our benchmark classifier for elles, which suffered from extremely low recall in the first classifiers and approaches the performance ofthe other classes, with nearly balanced precision and recall, in this final system. Since elles is a low-frequency class and cannot be reliably predicted using source context alone, we interpret this as evidence that our final neural network classifier has incorporated some relevant knowledge about pronominal anaphora that the baseline ME clas– – – – – sifier and earlier versions of our network have no access to. This is particularly remarkable because no data manually annotated for coreference was used for training. 8 Related work Even though it was recognised years ago that the information contained in parallel corpora may provide valuable information for the improvement of anaphora resolution systems, there have not been many attempts to cash in on this insight. Mitkov and Barbu (2003) exploit parallel data in English and French to improve pronominal anaphora resolution by combining anaphora resolvers for the individual languages with handwritten rules to resolve conflicts between the output of the language-specific resolvers. Veselovská et al. (2012) apply a similar strategy to English-Czech data to resolve different uses of the pronoun it. Other work has used word alignments to project coreference annotations from one language to another with a view to training anaphora resolvers in the target language (Postolache et al., 2006; de Souza and Or˘ asan, 2011). Rahman and Ng (2012) instead use machine translation to translate their test 389 data into a language for which they have an anaphora resolver and then project the annotations back to the original language. Completely unsupervised monolingual anaphora resolution has been approached using, e. g., Markov logic (Poon and Domingos, 2008) and the Expectation-Maximisation algorithm (Cherry and Bergsma, 2005; Charniak and Elsner, 2009). To the best of our knowledge, the direct application of machine learning techniques to parallel data in a task related to anaphora resolution is novel in our work. Neural networks and deep learning techniques have recently gained some popularity in natural language processing. They have been applied to tasks such as language modelling (Bengio et al., 2003; Schwenk, 2007), translation modelling in statistical machine translation (Le et al., 2012), but also part-ofspeech tagging, chunking, named entity recognition and semantic role labelling (Collobert et al., 2011). In tasks related to anaphora resolution, standard feedforward neural networks have been tested as a classifier in an anaphora resolution system (Stuckardt, 2007), but the network design presented in our work is novel. 9 Conclusion In this paper, we have introduced cross-lingual pronoun prediction as an independent natural language processing task. Even though it is not an end-to-end task, pronoun prediction is interesting for several reasons. It is related to the problem of pronoun translation in SMT, a currently unsolved problem that has been addressed in a number of recent research publications (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Guillou, 2012) without reaching a majorbreakthrough. In this work, we have shown that pronoun prediction can be effectively modelled in a neural network architecture with relatively simple features. More importantly, we have demonstrated that the task can be exploited to train a classifier with a latent representation of anaphoric links. With parallel text as its only supervision this classifier achieves a level of performance that is similar to, if not better than, that of a classifier using a regular anaphora resolution system trained with manually annotated data. References Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. Journal ofMachine Learning Research, 3:1137–1 155. Samuel Broscheit, Massimo Poesio, Simone Paolo Ponzetto, Kepa Joseba Rodriguez, Lorenza Romano, Olga Uryupina, Yannick Versley, and Roberto Zanoli. 2010. BART: A multilingual anaphora resolution system. In Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2010), Uppsala, Sweden, 15–16 July 2010. Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Conference of the European Associationfor Machine Translation (EAMT), pages 261–268, Trento, Italy. Eugene Charniak and Micha Elsner. 2009. EM works for pronoun anaphora resolution. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 148–156, Athens, Greece. Colin Cherry and Shane Bergsma. 2005. An Expectation Maximization approach to pronoun resolution. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pages 88– 95, Ann Arbor, Michigan. Michael Collins. 1999. Head-Driven Statistical Models forNatural Language Parsing. Ph.D. thesis, University of Pennsylvania. Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal ofMachine Learning Research, 12:2461–2505. José de Souza and Constantin Or˘ asan. 2011. Can projected chains in parallel corpora help coreference resolution? In Iris Hendrickx, Sobha Lalitha Devi, António Branco, and Ruslan Mitkov, editors, Anaphora Processing and Applications, volume 7099 of Lecture Notes in Computer Science, pages 59–69. Springer, Berlin. Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Associationfor Computational Linguistics, pages 1–10, Avignon, France. Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), pages 283–289, Paris, France. Christian Hardmeier. 2012. Discourse in statistical machine translation: A survey and a case study. Discours, 11. Dan Klein and Christopher D. Manning. 390 2003. Accu- rate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Associationfor Computational Linguistics, pages 423–430, Sapporo, Japan. Hai-Son Le, Alexandre Allauzen, and François Yvon. 2012. Continuous space translation models with neural networks. In Proceedings ofthe 2012 Conference ofthe North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 39–48, Montréal, Canada. Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Ruslan Mitkov and Catalina Barbu. 2003. Using bilingual corpora to improve pronoun resolution. Languages in Contrast, 4(2):201–21 1. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29: 19–51. Hoifung Poon and Pedro Domingos. 2008. Joint unsupervised coreference resolution with Markov Logic. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 650– 659, Honolulu, Hawaii. Oana Postolache, Dan Cristea, and Constantin Or˘ asan. 2006. Transferring coreference chains through word alignment. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC-2006), pages 889–892, Genoa. Altaf Rahman and Vincent Ng. 2012. Translation-based projection for multilingual coreference resolution. In Proceedings of the 2012 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, pages 720– 730, Montréal, Canada. Benoît Sagot, Lionel Clément, Éric Villemonte de La Clergerie, and Pierre Boullier. 2006. The Lefff 2 syntactic lexicon for French: architecture, acquisition, use. In Proceedings of the 5th Conference on International Language Resources and Evaluation (LREC2006), pages 1348–1351, Genoa. Holger Schwenk. 2007. Continuous space language models. Computer Speech and Language, 21(3):492–5 18. Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4):521–544. Roland Stuckardt. 2007. Applying backpropagation networks to anaphor resolution. In António Branco, editor, Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Collo- 2007, number 4410 in Lecture Notes in Artificial Intelligence, pages 107–124, Berlin. Kate ˇrina Veselovská, Ngu.y Giang Linh, and Michal Novák. 2012. Using Czech-English parallel corpora in quium, DAARC automatic identification of it. In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, pages 112–120, Istanbul, Turkey. 391

5 0.65620261 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

Author: Fang Kong ; Hwee Tou Ng

Abstract: Coreference resolution plays a critical role in discourse analysis. This paper focuses on exploiting zero pronouns to improve Chinese coreference resolution. In particular, a simplified semantic role labeling framework is proposed to identify clauses and to detect zero pronouns effectively, and two effective methods (refining syntactic parser and refining learning example generation) are employed to exploit zero pronouns for Chinese coreference resolution. Evaluation on the CoNLL-2012 shared task data set shows that zero pronouns can significantly improve Chinese coreference resolution.

6 0.6485076 65 emnlp-2013-Document Summarization via Guided Sentence Compression

7 0.63068867 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

8 0.62670213 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

9 0.62258887 45 emnlp-2013-Chinese Zero Pronoun Resolution: Some Recent Advances

10 0.612719 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

11 0.61171246 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

12 0.61089659 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

13 0.60692036 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks

14 0.6038987 68 emnlp-2013-Effectiveness and Efficiency of Open Relation Extraction

15 0.5959087 118 emnlp-2013-Learning Biological Processes with Global Constraints

16 0.59102821 156 emnlp-2013-Recurrent Continuous Translation Models

17 0.58944255 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach

18 0.57859105 160 emnlp-2013-Relational Inference for Wikification

19 0.57233477 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

20 0.57044291 75 emnlp-2013-Event Schema Induction with a Probabilistic Entity-Driven Model