emnlp emnlp2013 emnlp2013-155 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jing Liu ; Quan Wang ; Chin-Yew Lin ; Hsiao-Wuen Hon
Abstract: In this paper, we address the problem of estimating question difficulty in community question answering services. We propose a competition-based model for estimating question difficulty by leveraging pairwise comparisons between questions and users. Our experimental results show that our model significantly outperforms a PageRank-based approach. Most importantly, our analysis shows that the text of question descriptions reflects the question difficulty. This implies the possibility of predicting question difficulty from the text of question descriptions.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract In this paper, we address the problem of estimating question difficulty in community question answering services. [sent-12, score-1.376]
2 We propose a competition-based model for estimating question difficulty by leveraging pairwise comparisons between questions and users. [sent-13, score-0.989]
3 Most importantly, our analysis shows that the text of question descriptions reflects the question difficulty. [sent-15, score-0.809]
4 This implies the possibility of predicting question difficulty from the text of question descriptions. [sent-16, score-1.208]
5 1 Introduction In recent years, community question answering (CQA) services such as and Yahoo! [sent-17, score-0.615]
6 A great deal Answers2 Stackoverflow1 of research effort has been conducted on CQA, including: (1) question search (Xue et al. [sent-19, score-0.374]
7 , 2008); (3) user expertise estimation (Jurczyk and Agichtein, 2007; Zhang et al. [sent-30, score-0.381]
8 com 85 However, less attention has been paid to question difficulty estimation in CQA. [sent-41, score-0.814]
9 Question difficulty estimation can benefit many applications: (1) Experts are usually under time constraints. [sent-42, score-0.44]
10 We do not want to bore experts by routing every question (including both easy and hard ones) to them. [sent-43, score-0.533]
11 Assigning ques- tions to experts by matching question difficulty with expertise level, not just question topic, will make better use of the experts’ time and expertise (Ackerman and McDonald, 1996). [sent-44, score-1.661]
12 (2009) found that winning the point awards offered by the reputation system is a driving factor in user participation in CQA. [sent-46, score-0.14]
13 Question difficulty estimation would be helpful in designing a better incentive mechanism by assigning higher point awards to more difficult questions. [sent-47, score-0.475]
14 (3) Question difficulty estimation can help analyze user behavior in CQA, since users may make strategic choices when encountering questions of different difficulty levels. [sent-48, score-1.137]
15 To the best of our knowledge, not much research has been conducted on the problem of estimating question difficulty in CQA. [sent-49, score-0.829]
16 (2008) to estimate task difficulty in crowdsourcing contest services. [sent-51, score-0.39]
17 Their key idea is to construct a graph of tasks: creating an edge from a task t1 to a task t2 when a user u wins task t1 but loses task t2, implying that task t2 is likely to be more difficult than task t1. [sent-52, score-0.186]
18 This approach implicitly assumes that task difficulty is the only factor affecting the outcomes of competitions (i. [sent-56, score-0.544]
19 However, the outcomes of competitions depend on both the difficulty levels of tasks and the expertise levels of competitors (i. [sent-59, score-0.876]
20 (201 1), we propose a competition-based approach which jointly models question difficulty and user expertise level. [sent-65, score-1.095]
21 Given the two assumptions, we can determine the question difficulty score and user expertise score through pairwise comparisons between (1) a question and an asker, (2) a question and a best answerer, (3) a best answerer and an asker, and (4) a best answerer and all other non-best answerers. [sent-67, score-2.394]
22 The main contributions of this paper are: • We propose a competition-based approach to estim•aWte question difficulty (Sec. [sent-68, score-0.764]
23 , 2008) for estimating question difficulty on the data of Stack Overflow (Sec. [sent-71, score-0.829]
24 • Additionally, we calibrate question difficulty scores across two CQA lseibrvraictees q utoe verify tfhfiec ueltffyec s-tiveness of our model (Sec. [sent-74, score-0.852]
25 • Most importantly, we demonstrate that different wo•rdMs or tags irnta tnhtley question descriptions dinifdfeicraentet question difficulty levels. [sent-77, score-1.199]
26 This implies the possibility of predicting question difficulty purely from the text of question descriptions (Sec. [sent-78, score-1.269]
27 2 Competition based Question Difficulty Estimation CQA is a virtual community where people can ask questions and seek opinions from others. [sent-81, score-0.276]
28 Formally, when an asker ua posts a question q, there will be several answerers to answer her question. [sent-82, score-0.93]
29 One answer among the received ones will be selected as the best answer by the asker ua or voted by the community. [sent-83, score-0.504]
30 The user who provides the best answer is called the best answerer ub, and we denote the set of all non-best answerers as S = {uo1 , · · · , uoM }. [sent-84, score-0.561]
31 This is intuitive since the 86 best answer ub correctly responds to question q that asker ua does not know. [sent-86, score-0.97]
32 • The expertise score of the best answerer ub is higher eth eanxp tehrattis oef sacsokreer ua ahned b aelslt answerers iun S. [sent-87, score-0.935]
33 This is straightforward since the best answerer ub solves question q better than asker ua and all nonbest answerers in S. [sent-88, score-1.286]
34 Additionally, pseudo user uq wins the first competition and the best answerer ub wins all remaining ( |S| + 2) competitions. [sent-91, score-0.95]
35 Hence, th (|eS problem mofp estimating the question difficulty score (and the user expertise score) is cast as a problem of learning the relative skills of players from the win-loss results of the generated twoplayer competitions. [sent-92, score-1.312]
36 Formally, let Q denote the set of all questions in one category (or topic), and Rq denote the set of all two-player competitions generated from question q ∈ Q, i. [sent-93, score-0.616]
37 , Rq = {(ua ≺ uq) , (uq ≺ ub), (ua ≺ ub), (uo1 ≺ ub), · · · , (uo|S| ≺ ub)}, where j ≺ ≺i u means tha≺t user i· · b·e ,a(tsu user j i nu t)h}e, competition. [sent-95, score-0.21]
38 Our problem is then to learn the relative skills of players from R. [sent-97, score-0.152]
39 The learned skills of the pseudo question users are question difficulty scores, and the learned skills of all other users are their expertise scores. [sent-98, score-1.714]
40 , µµ µ 2011) and apply TrueSkill to learn the relative skills of players from the set of generated competitions R (Equ. [sent-100, score-0.257]
41 , 2007) is a Bayesian skill rating model that is developed for estimating the relative skill levels of players in games. [sent-103, score-0.839]
42 TrueSkill assumes that the practical performance of each player in a game follows a normal distribution N(µ, σ2), where means the skill level of the player and σ means the uncertainty of the estimated skill level. [sent-105, score-0.94]
43 Basically, TrueSkill learns the skill levels of players by leveraging Bayes’ theorem. [sent-106, score-0.449]
44 Given the current estimated skill levels of two players (priori probability) and the outcome of a new game between them (likelihood), TrueSkill model updates its estimation of player skill levels (posterior probability). [sent-107, score-1.088]
45 TrueSkill updates the skill level and the uncertainty σ intuitively: (a) if the outcome of a new competition is expected, i. [sent-108, score-0.57]
46 the player with higher skill level wins the game, it will cause small updates in skill level and uncertainty σ; (b) if the outcome of a new competition is unexpected, i. [sent-110, score-1.078]
47 the player with lower skill level wins the game, it will cause µµ large updates in skill level and uncertainty σ. [sent-112, score-0.935]
48 Here, ε i µs a parameter representing the probability of a draw in one game, and v(t, ε) and w(t, ε) are weighting factors for skill level and standard deviation σ respectively. [sent-114, score-0.356]
49 In this paper, we set the initial values of the skill level and the standard deviation σ of each player the same as the default values used in (Herbrich et al. [sent-117, score-0.427]
50 SO contains questions with various topics, such as programming, mathematics, and English. [sent-122, score-0.137]
51 com/ category/ cc-wiki-dump / 87 and mathematics4 (SO/Math) questions for our main experiments. [sent-125, score-0.137]
52 Additionally, we use the data of Math Overflow5 (MO) for calibrating question difficulty scores across communities (Sec. [sent-126, score-0.88]
53 O,0396 436 To evaluate the effectiveness of our proposed model for estimating question difficulty scores, we randomly sampled 300 question pairs from both SO/CPP and SO/Math, and we asked experts to compare the difficulty of every pair. [sent-131, score-1.664]
54 We had two graduate students majoring in computer science annotate the SO/CPP question pairs, and two graduate students majoring in mathematics annotate the SO/Math question pairs. [sent-132, score-0.98]
55 When annotating each question pair, only the titles, descriptions, and tags of the questions were shown, and other information (e. [sent-133, score-0.511]
56 There were 260 SO/CPP question pairs and 280 SO/Math question pairs remaining. [sent-142, score-0.748]
57 Figure 1: The distributions of calibrated question difficulty scores of MO and SO/Math. [sent-210, score-0.864]
58 This is because the PageRank-based approach only models the outcomes of competitions affected by question difficulty. [sent-215, score-0.528]
59 However, the outcomes of competitions depend on both the question difficulty levels and the expertise levels of competitors. [sent-216, score-1.25]
60 3 Calibrating Question Difficulty across CQA Services Both MO and SO/Math are CQA services for asking mathematics questions. [sent-225, score-0.147]
61 MO’s primary goal is asking and answering research level mathematics questions6. [sent-227, score-0.186]
62 Usually, the community members in MO are not interested in basic mathematics questions. [sent-229, score-0.153]
63 com/ propo s al s / 3 3 5 5 /mathemat i s c 88 a posted question is too elementary, someone will suggest moving it to SO/Math. [sent-233, score-0.396]
64 Similarly, if a posted question is advanced, the community members in SO/Math will recommend moving it to MO. [sent-234, score-0.493]
65 Hence, it is expected that the ratio of difficult questions in MO is higher than SO/Math. [sent-235, score-0.137]
66 We first calibrate the estimated question difficulty scores across these two services on a same scale. [sent-237, score-0.944]
67 We assume that if a user u1 on MO and a user u2 on SO/Math have the same home page URL, they should be linked as one natural person in the real world. [sent-240, score-0.262]
68 Then, we can calibrate the estimated question difficulty scores across the two services by performing the competition-based model on the joint data set. [sent-248, score-0.944]
69 Figure 1 shows the distributions of the calibrated question difficulty scores of MO and SO/Math on the same scale. [sent-249, score-0.864]
70 As expected, we observed that the ratio of difficult questions in MO was higher than SO/Math. [sent-250, score-0.137]
71 4 Analysis on the Question Descriptions In this section, we analyze the text of question descriptions on the scale of question difficulty scores estimated by the competition model. [sent-255, score-1.374]
72 Micro Level We first examine the frequency distributions of individual words over the question difficulty scores. [sent-256, score-0.796]
73 We observe that the words ’list’ and ’array’ have the lowest mean of difficulty scores, compared to the words ’virtual’ and ’gcc’. [sent-258, score-0.39]
74 This is reasonable, since ’list’ and ’array ’ are related Figure 2: Tag clouds on SO/Math questions with different difficulty levels ? [sent-259, score-0.621]
75 Figure 3: The frequency distributions of words on the scale of question difficulty scores (SO/CPP). [sent-306, score-0.836]
76 It can be observed that the order of the means of the difficulty scores of these words are well aligned to our learning process. [sent-308, score-0.43]
77 Macro Level We evenly split the range of question difficulty scores into n buckets, and we grouped the questions into the n buckets according to which bucket their difficulty scores were in. [sent-309, score-1.509]
78 Then, we had n question buckets and each bucket corresponded to a word distribution of questions. [sent-310, score-0.512]
79 Let variable X denote the distance between the difficulty scores in two question buckets (which is the difference between the average difficulty scores of questions in the two buckets), and variable Y denote the Jensen-Shannon distance between word distributions in two question buckets. [sent-311, score-1.89]
80 In other words, when the distance between the difficulty scores of two buckets become larger, the two word distributions in the two buckets become less similar, and vice versa. [sent-317, score-0.688]
81 89 We further visualized the word distribution in each question bucket. [sent-318, score-0.374]
82 We set n as 3, and we had three question buckets: (1) easy questions; (2) normal questions; and (3) hard questions. [sent-319, score-0.374]
83 4 plots the tag clouds of SO/Math questions in the three buckets. [sent-321, score-0.2]
84 We observed that (1) the tag ’homework’ and ’calculus ’ become smaller from easy questions to hard questions; (2) the tag ’set-theory’ becomes larger. [sent-323, score-0.181]
85 The above experimental results show that different words or tags of question descriptions reflect the question difficulty levels. [sent-325, score-1.199]
86 This implies the possibility of predicting question difficulty purely from the text of question descriptions. [sent-326, score-1.208]
87 4 Conclusion and Future Work In this paper, we address the problem of estimating question difficulty in CQA services. [sent-327, score-0.829]
88 Our proposed competition-based model for estimating question difficulty significantly outperforms the PageRankbased approach. [sent-328, score-0.829]
89 Most importantly, our analysis shows that the text of question descriptions reflects the question difficulty. [sent-329, score-0.809]
90 In the future, we would like to explore predicting question difficulty from the text of question descriptions. [sent-330, score-1.161]
91 A generalized framework of exploring category information for question retrieval in community question answer archives. [sent-369, score-0.915]
92 Searching questions by identifying question topic and question focus. [sent-378, score-0.885]
93 Question-answer topic model for question retrieval in community question answering. [sent-399, score-0.845]
94 Discovering authorities in question answer communities by using link analysis. [sent-405, score-0.479]
95 Routing questions to appropriate answerers in community question answering services. [sent-411, score-0.806]
96 Question routing in community question answering: putting category in its place. [sent-419, score-0.559]
97 Expert identification in community question answering: exploring question selection bias. [sent-453, score-0.845]
98 The use ofdependency relation graph to enhance the term weighting in question retrieval. [sent-485, score-0.374]
99 Routing questions to the right users in online communities. [sent-495, score-0.202]
100 Phrase-based translation model for question retrieval in community question answer archives. [sent-499, score-0.915]
wordName wordTfidf (topN-words)
[('difficulty', 0.39), ('question', 0.374), ('skill', 0.325), ('answerer', 0.264), ('expertise', 0.226), ('asker', 0.203), ('trueskill', 0.183), ('ub', 0.162), ('ua', 0.161), ('mo', 0.138), ('questions', 0.137), ('cqa', 0.135), ('answerers', 0.122), ('buckets', 0.113), ('competition', 0.111), ('user', 0.105), ('competitions', 0.105), ('community', 0.097), ('routing', 0.088), ('uq', 0.088), ('ackerman', 0.081), ('herbrich', 0.081), ('skills', 0.081), ('wins', 0.081), ('answering', 0.076), ('players', 0.071), ('experts', 0.071), ('player', 0.071), ('loser', 0.071), ('answer', 0.07), ('services', 0.068), ('users', 0.065), ('estimating', 0.065), ('bian', 0.061), ('jeon', 0.061), ('descriptions', 0.061), ('pseudo', 0.058), ('winner', 0.057), ('mathematics', 0.056), ('quan', 0.053), ('levels', 0.053), ('game', 0.053), ('cao', 0.051), ('agichtein', 0.051), ('estimation', 0.05), ('outcomes', 0.049), ('liu', 0.049), ('calibrate', 0.048), ('rq', 0.048), ('virtual', 0.042), ('bouguessa', 0.041), ('calibrating', 0.041), ('clouds', 0.041), ('gcc', 0.041), ('jurczyk', 0.041), ('overflow', 0.041), ('stackexchange', 0.041), ('suryanto', 0.041), ('scores', 0.04), ('uncertainty', 0.04), ('answers', 0.038), ('awards', 0.035), ('majoring', 0.035), ('communities', 0.035), ('zhou', 0.033), ('distributions', 0.032), ('outcome', 0.032), ('updates', 0.031), ('students', 0.031), ('level', 0.031), ('cong', 0.03), ('calibrated', 0.028), ('nam', 0.027), ('yahoo', 0.027), ('home', 0.027), ('pal', 0.027), ('importantly', 0.027), ('cui', 0.026), ('harbin', 0.026), ('assumptions', 0.025), ('linked', 0.025), ('duan', 0.025), ('array', 0.025), ('acc', 0.025), ('cai', 0.025), ('bucket', 0.025), ('possibility', 0.025), ('estimated', 0.024), ('pagerank', 0.024), ('additionally', 0.023), ('pairwise', 0.023), ('predicting', 0.023), ('asking', 0.023), ('proceedings', 0.023), ('king', 0.022), ('tag', 0.022), ('stack', 0.022), ('posted', 0.022), ('graduate', 0.022), ('implies', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 155 emnlp-2013-Question Difficulty Estimation in Community Question Answering Services
Author: Jing Liu ; Quan Wang ; Chin-Yew Lin ; Hsiao-Wuen Hon
Abstract: In this paper, we address the problem of estimating question difficulty in community question answering services. We propose a competition-based model for estimating question difficulty by leveraging pairwise comparisons between questions and users. Our experimental results show that our model significantly outperforms a PageRank-based approach. Most importantly, our analysis shows that the text of question descriptions reflects the question difficulty. This implies the possibility of predicting question difficulty from the text of question descriptions.
2 0.15391544 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction
Author: Aliaksei Severyn ; Alessandro Moschitti
Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.
Author: Mikhail Ageev ; Dmitry Lagun ; Eugene Agichtein
Abstract: Passage retrieval is a crucial first step of automatic Question Answering (QA). While existing passage retrieval algorithms are effective at selecting document passages most similar to the question, or those that contain the expected answer types, they do not take into account which parts of the document the searchers actually found useful. We propose, to the best of our knowledge, the first successful attempt to incorporate searcher examination data into passage retrieval for question answering. Specifically, we exploit detailed examination data, such as mouse cursor movements and scrolling, to infer the parts of the document the searcher found interesting, and then incorporate this signal into passage retrieval for QA. Our extensive experiments and analysis demonstrate that our method significantly improves passage retrieval, compared to using textual features alone. As an additional contribution, we make available to the research community the code and the search behavior data used in this study, with the hope of encouraging further research in this area.
Author: Baichuan Li ; Jing Liu ; Chin-Yew Lin ; Irwin King ; Michael R. Lyu
Abstract: Social media like forums and microblogs have accumulated a huge amount of user generated content (UGC) containing human knowledge. Currently, most of UGC is listed as a whole or in pre-defined categories. This “list-based” approach is simple, but hinders users from browsing and learning knowledge of certain topics effectively. To address this problem, we propose a hierarchical entity-based approach for structuralizing UGC in social media. By using a large-scale entity repository, we design a three-step framework to organize UGC in a novel hierarchical structure called “cluster entity tree (CET)”. With Yahoo! Answers as a test case, we conduct experiments and the results show the effectiveness of our framework in constructing CET. We further evaluate the performance of CET on UGC organization in both user and system aspects. From a user aspect, our user study demonstrates that, with CET-based structure, users perform significantly better in knowledge learning than using traditional list-based approach. From a system aspect, CET substantially boosts the performance of two information retrieval models (i.e., vector space model and query likelihood language model).
5 0.11409018 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
Author: Matthew Richardson ; Christopher J.C. Burges ; Erin Renshaw
Abstract: Christopher J.C. Burges Microsoft Research One Microsoft Way Redmond, WA 98052 cburge s @micro so ft . com Erin Renshaw Microsoft Research One Microsoft Way Redmond, WA 98052 erinren@mi cros o ft . com disciplines are focused on this problem: for example, information extraction, relation extraction, We present MCTest, a freely available set of stories and associated questions intended for research on the machine comprehension of text. Previous work on machine comprehension (e.g., semantic modeling) has made great strides, but primarily focuses either on limited-domain datasets, or on solving a more restricted goal (e.g., open-domain relation extraction). In contrast, MCTest requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Reading comprehension can test advanced abilities such as causal reasoning and understanding the world, yet, by being multiple-choice, still provide a clear metric. By being fictional, the answer typically can be found only in the story itself. The stories and questions are also carefully limited to those a young child would understand, reducing the world knowledge that is required for the task. We present the scalable crowd-sourcing methods that allow us to cheaply construct a dataset of 500 stories and 2000 questions. By screening workers (with grammar tests) and stories (with grading), we have ensured that the data is the same quality as another set that we manually edited, but at one tenth the editing cost. By being open-domain, yet carefully restricted, we hope MCTest will serve to encourage research and provide a clear metric for advancement on the machine comprehension of text. 1 Reading Comprehension A major goal for NLP is for machines to be able to understand text as well as people. Several research 193 semantic role labeling, and recognizing textual entailment. Yet these techniques are necessarily evaluated individually, rather than by how much they advance us towards the end goal. On the other hand, the goal of semantic parsing is the machine comprehension of text (MCT), yet its evaluation requires adherence to a specific knowledge representation, and it is currently unclear what the best representation is, for open-domain text. We believe that it is useful to directly tackle the top-level task of MCT. For this, we need a way to measure progress. One common method for evaluating someone’s understanding of text is by giving them a multiple-choice reading comprehension test. This has the advantage that it is objectively gradable (vs. essays) yet may test a range of abilities such as causal or counterfactual reasoning, inference among relations, or just basic understanding of the world in which the passage is set. Therefore, we propose a multiple-choice reading comprehension task as a way to evaluate progress on MCT. We have built a reading comprehension dataset containing 500 fictional stories, with 4 multiple choice questions per story. It was built using methods which can easily scale to at least 5000 stories, since the stories were created, and the curation was done, using crowd sourcing almost entirely, at a total of $4.00 per story. We plan to periodically update the dataset to ensure that methods are not overfitting to the existing data. The dataset is open-domain, yet restricted to concepts and words that a 7 year old is expected to understand. This task is still beyond the capability of today’s computers and algorithms. ProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 1t9ic3s–203, By restricting the concept space, we gain the difficulty of being an open-domain problem, without the full complexity of the real world (for example, there will be no need for the machine to understand politics, technology, or to have any domain specific expertise). The multiple choice task avoids ambiguities (such as when the task is to find a sentence that best matches a question, as in some early reading comprehension tasks: see Section 2), and also avoids the need for additional grading, such as is needed in some TREC tasks. The stories were chosen to be fictional to focus work on finding the answer in the story itself, rather than in knowledge repositories such as Wikipedia; the goal is to build technology that actually understands stories and paragraphs on a deep level (as opposed to using information retrieval methods and the redundancy of the web to find the answers). We chose to use crowd sourcing, as opposed to, for example, contracting teachers or paying for existing standardized tests, for three reasons, namely: (1) scalability, both for the sizes of datasets we can provide, and also for the ease of regularly refreshing the data; (2) for the variety in story-telling that having many different authors brings; and (3) for the free availability that can only result from providing non-copyrighted data. The content is freely available at http://research.microsoft.com/mct, and we plan to use that site to track published results and provide other resources, such as labels of various kinds. 2 Previous Work The research goal of mapping text to meaning representations in order to solve particular tasks has a long history. DARPA introduced the Airline Travel Information System (ATIS) in the early 90’s: there the task was to slot-fill flight-related information by modeling the intent of spoken language (see Tur et al., 2010, for a review). This data continues to be a used in the semantic modeling community (see, for example, Zettlemoyer and Collins, 2009). The Geoquery database contains 880 geographical facts about the US and has played a similar role for written (as opposed to spoken) natural language queries against a database (Zelle and Mooney, 1996) and it also continues to spur research (see for example Goldwasser et al., 2011), as does the similar Jobs database, which provides mappings of 640 sentences to a listing of jobs 194 (Tang and Mooney, 2001). More recently, Zweig and Burges (2012) provided a set of 1040 sentences that comprise an SAT-style multiple choice sentence completion task. The idea of using story-based reading comprehension questions to evaluate methods for machine reading itself goes back over a decade, when Hirschmann et al. (1999) showed that a bag of words approach, together with some heuristic linguistic modeling, could achieve 40% accuracy for the task of picking the sentence that best matches the query for “who / what / when / where / why” questions, on a small reading comprehension dataset from Remedia. This dataset spurred several research efforts, for example using reinforcement learning (Grois and Wilkins, 2005), named entity resolution (Harabagiu et al., 2003) and mapping questions and answers to logical form (Wellner et al., 2006). Work on story understanding itself goes back much further, to 1972, when Charniak proposed using a background model to answer questions about children’s stories. Similarly, the TREC (and TAC) Question Answering tracks (e.g., Voorhees and Tice, 1999) aim to evaluate systems on their ability to answer factual questions such as “Where is the Taj Mahal”. The QA4MRE task also aims to evaluate machine reading systems through question answering (e.g., Clark et al., 2012). Earlier work has also aimed at controlling the scope by limiting the text to children’s stories: Breck et al. (2001) collected 75 stories from the Canadian Broadcasting Corporation’s web site for children, and generated 650 questions for them manually, where each question was answered by a sentence in the text. Leidner et al. (2003) both enriched the CBC4kids data by adding several layers of annotation (such as semantic and POS tags), and measured QA performance as a function of question difficulty. For a further compendium of resources related to the story comprehension task, see Mueller (2010). The task proposed here differs from the above work in several ways. Most importantly, the data collection is scalable: if the dataset proves sufficiently useful to others, it would be straightforward to gather an order of magnitude more. Even the dataset size presented here is an order of magnitude larger than the Remedia or the CBC4kids data and many times larger than QA4MRE. Second, the multiple choice task presents less ambiguity (and is consequently easier to collect data for) than the ly from MC500 train set). task of finding the most appropriate sentence, and may be automatically evaluated. Further, our stories are fictional, which means that the information to answer the question is contained only in the story itself (as opposed to being able to directly leverage knowledge repositories such as Wikipedia). 195 This design was chosen to focus the task on the machine understanding of short passages, rather than the ability to match against an existing knowledge base. In addition, while in the CBC4kids data each answer was a sentence from the story, here we required that approximately half of the questions require at least two sentences from the text to answer; being able to control complexity in this way is a further benefit of using multiple choice answers. Finally, as explained in Section 1, the use of free-form input makes the problem open domain (as opposed to the ATIS, Geoquery and Jobs data), leading to the hope that solutions to the task presented here will be easier to apply to novel, unrelated tasks. 3 Generating the Stories and Questions Our aim was to generate a corpus of fictional story that could be scaled with as little expert input as possible. Thus, we designed the process to be gated by cost, and keeping the costs low was a high priority. Crowd-sourcing seemed particularly appropriate, given the nature of the task, so we opted to use Amazon Mechanical Turk2 (AMT). With over 500,000 workers3, it provides the work sets1 force required to both achieve scalability and, equally importantly, to provide diversity in the stories and types of questions. We restricted our task to AMT workers (workers) residing in the United States. The average worker is 36 years old, more educated than the United States population in general (Paolacci et al., 2010), and the majority of workers are female. 3.1 The Story and Questions Workers were instructed to write a short (150-300 words) fictional story, and to write as if for a child in grade school. The choice of 150-300 was made to keep the task an appropriate size for workers while still allowing for complex stories and questions. The workers were free to write about any topic they desired (as long as it was appropriate for a young child), and so there is a wide range, including vacations, animals, school, cars, eating, gardening, fairy tales, spaceships, and cowboys. 1 We use the term “story set” to denote the fictional story together with its multiple choice questions, hypothetical answers, and correct answer labels. 2 http://www.mturk.com 3 https://requester.mturk.com/tour Workers were also asked to provide four reading comprehension questions pertaining to their story and, for each, four multiple-choice answers. Coming up with incorrect alternatives (distractors) is a difficult task (see, e.g., Agarwal, 2011) but workers were requested to provide “reasonable” incorrect answers that at least include words from the story so that their solution is not trivial. For example, for the question “What is the name of the dog?”, if only one of the four answers occurs in the story, then that answer must be the correct one. Finally, workers were asked to design their questions and answers such that at least two of the four questions required multiple sentences from the story to answer them. That is, for those questions it should not be possible to find the answer in any individual sentence. The motivation for this was to ensure that the task could not be fully solved using lexical techniques, such as word matching, alone. Whilst it is still possible that a sophisticated lexical analysis could completely solve the task, requiring that answers be constructed from at least two different sentences in the story makes this much less likely; our hope is that the solution will instead require some inference and some form of limited reasoning. This hope rests in part upon the observation that standardized reading comprehension tests, whose goal after all is to test comprehension, generally avoid questions that can be answered by reading a single sentence. 3.2 Automatic Validation Besides verifying that the story and all of the questions and answers were provided, we performed the following automatic validation before allowing the worker to complete the task: Limited vocabulary: The lowercase words in the story, questions, and answers were stemmed and checked against a vocabulary list of approximately 8000 words that a 7-year old is likely to know (Kuperman et al., 2012). Any words not on the list were highlighted in red as the worker typed, and the task could not be submitted unless all of the words satisfied this vocabulary criterion. To allow the use of arbitrary proper nouns, capitalized words were not checked against the vocabulary list. Multiple-sentence questions: As described earlier, we required that at least two of the questions need multiple sentences to answer. Workers were simply asked to mark whether a question needs one 196 or multiple sentences and we required that at least two are marked as multiple. 3.3 The Workers Workers were required to reside in the United States and to have completed 100 HITs with an over 95% approval The median worker took 22 minutes to complete the task. We paid workers $2.50 per story set and allowed each to do a maximum of 8 tasks (5 in MC500). We did not experiment with paying less, but this rate amounts to $6.82/hour, which is approximately the rate paid by other writing tasks on AMT at the time, though is also significantly higher than the median wage of $1.38 found in 2010 (Horton and Chilton, 2010). Workers could optionally leave feedback on the task, which was overwhelmingly positive – the most frequent non-stopword in the comments was “fun” and the most frequent phrase was “thank you”. The only negative comments (in <1% of submissions) were when the worker felt that a particular word should have been on the allowed vocabulary list. Given the positive feedback, it may be possible to pay less if we collect more data in the future. We did not enforce story length constraints, but some workers interpreted our suggestion that the story be 150-300 words as a hard rate4. constraint, and some asked to be able to write a longer story. The MCTest corpus contains two sets of stories, named MC160 and MC500, and containing 160 and 500 stories respectively. MC160 was gathered first, then some improvements were made before gathering MC500. We give details on the differences between these two sets below. 3.4 MC160: Manually Curated for Quality In addition to the details described above, MC160 workers were given a target elementary grade school level (1-4) and a sample story matching that grade level5. The intent was to produce a set of stories and questions that varied in difficulty so that research work can progress grade-by-grade if needed. However, we found little difference between grades in the corpus.. After gathering the stories, we manually curated the MC160 corpus by reading each story set and 4 The latter two are the default AMT requirements. 5 From http://www.englishforeveryone.org/. correcting errors. The most common mistakes were grammatical, though occasionally questions and/or answers needed to be fixed. 66% of the stories have at least one correction. We provide both the curated and original corpuses in order to allow research on reading comprehension in the presence of grammar, spelling, and other mistakes. 3.5 MC500: Adding a Grammar Test Though the construction of MC160 was successful, it requires a costly curation process which will not scale to larger data sets (although the curation was useful, both for improving the design of MC500, and for assessing the effectiveness of automated curation techniques). To more fully automate the process, we added two more stages: (1) A grammar test that automatically pre-screens workers for writing ability, and (2) a second Mechanical Turk task whereby new workers take the reading comprehension tests and rate their quality. We will discuss stage (2) in the next section. The grammar test consisted of 20 sentences, half of which had one grammatical error (see Figure 2). The incorrect sentences were written using common errors such as you ’re vs. your, using ‘s to indicate plurality, incorrect use of tense, it’ ’s vs. its, 197 NGoGramram mar TreTsets Q(u134a-.52l3i) tyaAn73ib m0o% aults Table 1. Pre-screening workers using a grammar test improves both quality and diversity of stories. Both differences are significant using the two-tailed t-test (p<0.05 for quality and p<0.01 for animals). less vs. fewer, I me, etc. Workers were required vs. to indicate for each sentence whether it was grammatically correct or not, and had to pass with at least 80% accuracy in order to qualify for the task. The 80% threshold was chosen to trade off worker quality with the rate at which the tasks would be completed; initial experiments using a threshold of 90% indicated that collecting 500 stories would take many weeks instead of days. Note that each worker is allowed to write at most 5 stores, so we required at least 100 workers to pass the qualification test. To validate the use of the qualification test, we gathered 30 stories requiring the test (qual) and 30 stories without. We selected a random set of 20 stories (10 from each), hid their origin, and then graded the overall quality of the story and questions from 1-5, meaning do not attempt to fix, bad but rescuable, has non-minor problems, has only minor problems, and has no problems, respectively. Results are shown in Table 1. The difference is statistically significant (p<0.05, using the twotailed t-test). The qual stories were also more diverse, with fewer of them about animals (the most common topic). Additional Modifications: Based on our experience curating MC160, we also made the following modifications to the task. In order to eliminate trivially-answerable questions, we required that each answer be unique, and that either the correct answer did not appear in the story or, if it did appear, that at least two of the incorrect answers also appeared in the story. This is to prevent questions that are trivially answered by checking which answer appears in the story. The condition on whether the correct answer appears is to allow questions such as “How many candies did Susan eat?”, where the total may never appear in the story, even though the information needed to derive it does. An answer is considered to appear in the story if at least half (rounded down) of its non-stopword terms appear in the story (ignoring word endings). This check is done automatically and must be satisfied before the worker is able to complete the task. Workers could also bypass the check if they felt it was incorrect, by adding a special term to their answer. We were also concerned that the sample story might bias the workers when writing the story set, particularly when designing questions that require multiple sentences to answer. So, we removed the sample story and grade level from the task. Finally, in order to encourage more diversity of stories, we added creativity terms, a set of 15 nouns chosen at random from the allowed vocabulary set. Workers were asked to “please consider” using one or more of the terms in their story, but use of the words was strictly optional. On average, workers used 3.9 of the creativity terms in their stories. 4 Rating the Stories and Questions In this section we discuss the crowd-sourced rating of story sets. We wished to ensure story set quality despite the fact that MC500 was only minimally manually curated (see below). Pre-qualifying workers with a grammar test was one step of this process. The second step was to have additional workers on Mechanical Turk both evaluate each story and take its corresponding test. Each story was evaluated in this way by 10 workers, each of whom provided scores for each of ageappropriateness (yes/maybe/no), grammaticality (few/some/many errors), and story clarity (excellent/reasonable/poor). When answering the four reading comprehension questions, workers could also mark a question as “unclear”. Each story set was rated by 10 workers who were each paid $0. 15 per set. Since we know the purportedly correct answer, we can estimate worker quality by measuring what fraction of questions that worker got right. Workers with less than 80% accuracy (ignoring those questions marked as unclear) were removed from the set. This constituted just 4.1% of the raters and 4.2% of the judgments (see Figure 3). Only one rater appeared to be an intentional spammer, answering 1056 questions with only 29% accuracy. The others primarily judged only one story. Only one worker fell between, answering 336 questions with just 75% accuracy. 198 Figure 3. Just 4.1% of raters had an accuracy below 80% (constituting 4.2% of the judgments). For the remaining workers (those who achieved at least 80% accuracy), we measured median story appropriateness, grammar, and clarity. For each category, stories for which less than half of the ratings were the best possible (e.g., excellent story clarity) were inspected and optionally removed from the data set. This required inspecting 40 (<10%) of the stories, only 2 of which were deemed poor enough to be removed (both of which had over half of the ratings all the way at the bot- tom end of the scale, indicating we could potentially have inspected many fewer stories with the same results). We also inspected questions for which at least 5 workers answered incorrectly, or answered “unclear”. In total, 29 questions (<2%) were inspected. 5 were fixed by changing the question, 8 by changing the answers, 2 by changing both, 6 by changing the story, and 8 were left unmodified. Note that while not fully automated, this process of inspecting stories and repairing questions took one person one day, so is still scalable to at least an order of magnitude more stories. 5 Dataset Analysis In Table 2, we present results demonstrating the value of the grammar test and curation process. As expected, manually curating MC160 resulted in increased grammar quality and percent of questions answered correctly by raters. The goal of MC500 was to find a more scalable method to achieve the same quality as the curated MC160. As Table 2 shows, the grammar test improved story grammar quality from 1.70 to 1.77 (both uncurated). The rating and one-day curation process in- TS51a06bet0l cu2r.ateAdvra1 Ag.9e8241aAgpe1 Cp.67la13r57oiptyrae1 Gn.7er8a90s74mǂ, satroy9C 567oc.lr397a eritcy, grammar quality (0-2, with 2 being best), and percent of questions answered correctly by raters, for the original and curated versions of the data. Bold indicates statistical significance vs. the original version of the same set, using the two-sample t-test with unequal variance. The indicates the only statistical difference between 500 curated and 160 curated. ǂ TMCaob lre5p10u63s. CoSr51tp06uise tawM2it06sreimt dcinsea gfnorS2Mt1o0C2r4Ay1v6eQ0raug7e8n.ds0W7tMionCrd5s0APne3.sr:4w er creases this to 1.79, whereas a fully manual curation results in a score of 1.84. Curation also improved the percent of questions answered correctly for both MC160 and MC500, but, unlike with grammar, there is no significant difference between the two curated sets. Indeed, the only statis- tically significant difference between the two is in grammar. So, the MC500 grammar test and curation process is a very scalable method for collecting stories of nearly the quality of the costly manual curation of MC160. We also computed correlations between these measures of quality and various factors such as story length and time spent writing the story. On MC500, there is a mild correlation between a worker’s grammar test score and the judged grammar quality of that worker’s story (correlation of 0.24). Interestingly, this relation disappeared once MC500 was curated, likely due to repairing the stories with the worst grammar. On MC160, there is a mild correlation between the clarity and the number of words in the question and answer (0.20 and 0.18). All other correlations were below 0. 15. These factors could be integrated into an estimate for age-appropriateness, clarity, and grammar, potentially reducing the need for raters. Table 3 provides statistics on each corpus. MC160 and MC500 are similar in average number of words per story, question, and answer, as well as the median writing time. The most commonly used 199 Baseline Algorithms Require: Passage P, set of passage words PW, ith word in passage Pi, set of words in question Q, set of words in hypothesized answers A1..4, and set of stop words U, Define: ( ) ∑( ) Define: ( ) ( ( )). Algorithm 1 Sliding Window Algorithm 1 Sliding Window for i= 1to 4 do | | ∑ | |{ ( ) end for return Algorithm 2 Distance Based for i= 1to 4 do ( ) (( ) ) if | | else or | | | |( ), where ()is the minimum number of words between an occurrence of q and an occurrence of a in P, plus one. end if end for return Algorithm Return SW Algorithm SW+D Return Figure 4. The two lexical-based algorithms used for the baselines. nouns in MC500 are: day, friend, time, home, house, mother, dog, mom, school, dad, cat, tree, and boy. The stories vary widely in theme. The first 10 stories of the randomly-ordered MC500 set are about: travelling to Miami to visit friends, waking up and saying hello to pets, a bully on a schoolyard, visiting a farm, collecting insects at Grandpa’s house, planning a friend’s birthday party, selecting clothes for a school dance, keeping animals from eating your ice cream, animals ordering food, and adventures of a boy and his dog. TSMaAiblCnuge1tli460.Per5TcS9reW.an54it360 caonrQdS6e’W7c8Dst.4+1e7vfD5o:rthem465S8u.W4l28t93ip Tl4e0cQsht:o’S5i76Wc e.8+2q1D95ues- tions for MC160. SW: sliding window algorithm. SW+D: combined results with sliding window and distance based algorithms. Single/Multi: questions marked by worker as requiring a single/multiple sentence(s) to answer. All differences between SW and SW+D are significant (p<0.01 using the two-tailed paired t-test). TASMabiluCnge5tli0.Pe5T4rSc92a.We18ni304t ac0noSQrd56W’8e1D.sc2+7et8Dv1fo:rt5hS1eW.85m603uTletQsiSp5W:l’76es.31+c570hDoiceS65Wq0A7u.4l5+e 3Ds- tions for MC500, notation as above. All differences between SW and SW+D are significant (p<0.01, tested as above). We randomly divided MC160 and MC500 into train, development, and test sets of 70, 30, and 60 stories and 300, 50, and 150 stories, respectively. 6 Baseline System and Results We wrote two baseline systems, both using only simple lexical features. The first system used a sliding window, matching a bag of words constructed from the question and hypothesized answer to the text. Since this ignored long range dependencies, we added a second, word-distance based algorithm. The distance-based score was simply subtracted from the window-based score to arrive at the final score (we tried scaling the distance score before subtraction but this did not improve results on the MC160 train set). The algorithms are summarized in Figure 4. A coin flip is used to break ties. The use of inverse word counts was inspired by TF-IDF. Results for MC160 and MC500 are shown in Table 4 and Table 5. The MC160 train and development sets were used for tuning. The baseline algorithm was authored without seeing any portion of MC500, so both the MC160 test set and all of 200 BRCoaTsmEelbin e d(SW+D)65 M967C. 76219506ǂ 0Test5 6M603C. 685 7320ǂ 0Test Table 6. Percent correct for MC160 and MC500 test sets. The ǂ indicates statistical significance vs. baseline (p<0.01 using the two-tailed paired t-test). MC160 combined vs. baseline has p-value 0.063. MC500 were used for testing (although we nevertheless report results on the train/test split). Note that adding the distance based algorithm improved accuracy by approximately 10% absolute on MC160 and approximately 6% on MC500. Overall, error rates on MC500 are higher than on MC160, which agrees with human performance (see Table 2), suggesting that MC500’s questions are more difficult. 7 Recognizing Textual Entailment Results We also tried using a “recognizing textual entailment” (RTE) system to answer MCTest questions. The goal of RTE (Dagan et al., 2005) is to determine whether a given statement can be inferred from a particular text. We can cast MCTest as an RTE task by converting each question-answer pair into a statement, and then selecting the answer whose statement has the highest likelihood of being entailed by the story. For example, in the sample story given in Figure 1, the second question can be converted into four statements (one for each answer), and the RTE system should select the statement “James pulled pudding off of the shelves in the grocery store” as the most likely one. For converting question-answer pairs to statements, we used the rules employed in a web-based question answering system (Cucerzan and Agichtein, 2005). For RTE, we used BIUTEE (Stern and Dagan, 2011), which performs better than the median system in the past four RTE competitions. We ran BIUTEE both in its default configuration, as well as with its optional additional data sources (FrameNet, ReVerb, DIRT, and others as found on the BIUTEE home page). The default configuration performed better so we present its results here. The results in Table 6 show that the RTE method performed worse than the baseline. We also combined the baseline and RTE system by training BIUTEE on the train set and using the development set to optimize a linear combination of BIUTEE with the baseline; the combined system outperforms either component system on MC500. It is possible that with some tuning, an RTE system will outperform our baseline system. Nevertheless, these RTE results, and the performance of the baseline system, both suggest that the reading comprehension task described here will not be trivially solved by off-the-shelf techniques. 8 Making Data and Results an Ongoing Resource Our goal in constructing this data is to encourage research and innovation in the machine comprehension of text. Thus, we have made both MC160 and MC500 freely available for download at http://research.microsoft.com/mct. To our knowledge, these are the largest copyright-free reading comprehension data sets publicly available. To further encourage research on these data, we will be continually updating the webpage with the bestknown published results to date, along with pointers to those publications. One of the difficulties in making progress on a particular task is implementing previous work in order to apply improvements to it. To mitigate this difficulty, we are encouraging researchers who use the data to (optionally) provide per-answer scores from their system. Doing so has three benefits: (a) a new system can be measured in the context of the errors made by the previous systems, allowing each research effort to incrementally add useful functionality without needing to also re-implement the current state-of-the-art; (b) it allows system performance to be measured using paired statistical testing, which will substantially increase the ability to determine whether small improvements are significant; and (c) it enables researchers to perform error analysis on any of the existing systems, simplifying the process of identifying and tackling common sources of error. We will also periodically ensemble the known systems using standard machine learning techniques and make those results available as well (unless the existing state-of-theart already does such ensembling). The released data contains the stories and questions, as well as the results from workers who rated 201 the stories and took the tests. The latter may be used, for example, to measure machine performance vs. human performance on a per-question basis (i.e., does your algorithm make similar mistakes to humans?), or vs. the judged clarity of each story. The ratings, as well as whether a question needs multiple sentences to answer, should typically only be used in evaluation, since such information is not generally available for most text. We will also provide an anonymized author id for each story, which could allow additional research such as using other works by the same author when understanding a story, or research on authorship attribution (e.g., Stamatatos, 2009). 9 Future Work We plan to use this dataset to evaluate approaches for machine comprehension, but are making it available now so that others may do the same. If MCTest is used we will collect more story sets and will continue to refine the collection process. One interesting research direction is ensuring that the questions are difficult enough to challenge state-ofthe-art techniques as they develop. One idea for this is to apply existing techniques automatically during story set creation to see whether a question is too easily answered by a machine. By requiring authors to create difficult questions, each data set will be made more and more difficult (but still answerable by humans) as the state-of-the-art methods advance. We will also experiment with timing the raters as they answer questions to see if we can find those that are too easy for people to answer. Removing such questions may increase the difficulty for machines as well. Additionally, any divergence between how easily a person answers a question vs. how easily a machine does may point toward new techniques for improving machine comprehension; we plan to conduct research in this direction as well as make any such data available for others. 10 Conclusion We present the MCTest dataset in the hope that it will help spur research into the machine comprehension of text. The metric (the accuracy on the question sets) is clearly defined, and on that metric, lexical baseline algorithms only attain approximately 58% correct on test data (the MC500 set) as opposed to the 100% correct that the majority of crowd-sourced judges attain. A key component of MCTest is the scalable design: we have shown that data whose quality approaches that of expertly cu- rated data can be generated using crowd sourcing coupled with expert correction of worker-identified errors. Should MCTest prove useful to the community, we will continue to gather data, both to increase the corpus size, and to keep the test sets fresh. The data is available at http://research.microsoft.com/mct and any submitted results will be posted there too. Because submissions will be requested to include the score for each test item, researchers will easily be able to compare their systems with those of others, and investigation of ensembles comprised of components from several different teams will be straightforward. MCTest also contains supplementary material that researchers may find useful, such as worker accuracies on a grammar test and crowd-sourced measures of the quality of their stories. Acknowledgments We would like to thank Silviu Cucerzan and Lucy Vanderwende for their help with converting questions to statements and other useful discussions. References M. Agarwal and P. Mannem. 2011. Automatic Gap-fill Question Generation from Text Books. In Proceed- ings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications, 56–64. E. Breck, M. Light, G.S.Mann, E. Riloff, B. Brown, P. Anand, M. Rooth M. Thelen. 2001. Looking under the hood: Tools for diagnosing your question answering engine. In Proceedings of the workshop on Opendomain question answering, 12, 1-8. E. Charniak. 1972. Toward a Model of Children’s Story Comprehension. Technical Report, 266, MIT Artificial Intelligence Laboratory, Cambridge, MA. P. Clark, P. Harrison, and X. Yao. An Entailment-Based Approach to the QA4MRE Challenge. 2012. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF) 2012. S. Cucerzan and E. Agichtein. 2005. Factoid Question Answering over Unstructured and Structured Content on the Web. In Proceedings of the Fourteenth Text Retrieval Conference (TREC). I. Dagan, O. Glickman, and B. Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini, F. d'Alché-Buc (Eds.), Machine Learning 202 Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer. D. Goldwasser, R. Reichart, J. Clarke, D. Roth. 2011. Confidence Driven Unsupervised Semantic Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 1486-1495. E. Grois and D.C. Wilkins. 2005. Learning Strategies for Story Comprehension: A Reinforcement Learning Approach. In Proceedings of the Twenty Second International Conference on Machine Learning, 257264. S.M. Harabagiu, S.J. Maiorano, and M.A. Pasca. 2003. Open-Domain Textual Question Answering Techniques. Natural Language Engineering, 9(3): 1-38. Cambridge University Press, Cambridge, UK. L. Hirschman, M. Light, E. Breck, and J.D. Burger. 1999. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), 325-332. J. Horton and L. Chilton. 2010. The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM Conference on Electronic Commerce, 209-218. V. Kuperman, H. Stadthagen-Gonzalez, M. Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4):978-990. J.L. Leidner, T. Dalmas, B. Webber, J. Bos, C. Grover. 2003. Automatic Multi-Layer Corpus Annotation for Evaluating Question Answering Methods: CBC4Kids. In Proceedings of the 3rd International Workshop on Linguistically Interpreted Corpora. E.T. Mueller. 2010. Story Understanding Resources. http://xenia.media.mit.edu/~mueller/storyund/storyre s.html. G. Paolacci, J. Chandler, and P. Iperirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making. 5(5):41 1-419. E. Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci., 60:538– 556. A. Stern and I. Dagan. 2011. A Confidence Model for Syntactically-Motivated Entailment Proofs. In Proceedings of Recent Advances in Natural Language Processing (RANLP). L.R. Tang and R.J. Mooney. 2001. Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing. In Proceedings of the European Conference on Machine Learning (ECML), 466-477. G. Tur, D. Hakkani-Tur, and L.Heck. 2010. What is left to be understood in ATIS? Spoken Language Technology Workshop, 19-24. E.M. Voorhees and D.M. Tice. 1999. The TREC-8 Question Answering Track Evaluation. In Proceedings of the Eighth Text Retrieval Conference (TREC8). 12th Wellner, L. Ferro, W. Greiff, and L. Hirschman. 2005. Reading comprehension tests for computerbased understand evaluation. Natural Language Engineering, 12(4):305-334. Cambridge University Press, Cambridge, UK. J.M. Zelle and R.J. Mooney. 1996. Learning to Parse Database Queries using Inductive Logic Programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI), 10501055. B. Zettlemoyer and M. Collins. 2009. Learning Context-Dependent Mappings from Sentences to Logical Form. In Proceedings of the 47th Annual Meeting of the Association for Computation Linguistics (ACL), 976-984. G. Zweig and C.J.C. Burges. 2012. A Challenge Set for Advancing Language Modeling. In Proceedings of the Workshop on the Future of Language Modeling for HLT, NAACL-HLT. L.S. 203
6 0.055700947 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations
7 0.055512816 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs
8 0.047669243 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems
9 0.045703813 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game
10 0.044410687 5 emnlp-2013-A Discourse-Driven Content Model for Summarising Scientific Articles Evaluated in a Complex Question Answering Task
11 0.040990662 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations
12 0.040622775 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts
13 0.040433057 121 emnlp-2013-Learning Topics and Positions from Debatepedia
14 0.040426932 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
15 0.034459554 16 emnlp-2013-A Unified Model for Topics, Events and Users on Twitter
16 0.033393387 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
17 0.033078514 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior
18 0.032864369 196 emnlp-2013-Using Crowdsourcing to get Representations based on Regular Expressions
19 0.03162872 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
20 0.030293033 97 emnlp-2013-Identifying Web Search Query Reformulation using Concept based Matching
topicId topicWeight
[(0, -0.114), (1, 0.044), (2, -0.041), (3, 0.021), (4, -0.017), (5, 0.082), (6, 0.021), (7, 0.076), (8, 0.106), (9, -0.009), (10, -0.042), (11, 0.15), (12, -0.045), (13, -0.152), (14, 0.106), (15, 0.092), (16, 0.154), (17, 0.218), (18, 0.041), (19, -0.077), (20, 0.035), (21, 0.02), (22, 0.074), (23, -0.089), (24, 0.115), (25, -0.091), (26, -0.068), (27, -0.011), (28, -0.095), (29, 0.075), (30, -0.04), (31, -0.05), (32, 0.02), (33, 0.042), (34, -0.027), (35, 0.014), (36, -0.101), (37, 0.048), (38, -0.034), (39, 0.055), (40, -0.012), (41, 0.014), (42, 0.039), (43, -0.084), (44, 0.019), (45, -0.056), (46, -0.008), (47, -0.05), (48, 0.001), (49, 0.066)]
simIndex simValue paperId paperTitle
same-paper 1 0.98687828 155 emnlp-2013-Question Difficulty Estimation in Community Question Answering Services
Author: Jing Liu ; Quan Wang ; Chin-Yew Lin ; Hsiao-Wuen Hon
Abstract: In this paper, we address the problem of estimating question difficulty in community question answering services. We propose a competition-based model for estimating question difficulty by leveraging pairwise comparisons between questions and users. Our experimental results show that our model significantly outperforms a PageRank-based approach. Most importantly, our analysis shows that the text of question descriptions reflects the question difficulty. This implies the possibility of predicting question difficulty from the text of question descriptions.
Author: Mikhail Ageev ; Dmitry Lagun ; Eugene Agichtein
Abstract: Passage retrieval is a crucial first step of automatic Question Answering (QA). While existing passage retrieval algorithms are effective at selecting document passages most similar to the question, or those that contain the expected answer types, they do not take into account which parts of the document the searchers actually found useful. We propose, to the best of our knowledge, the first successful attempt to incorporate searcher examination data into passage retrieval for question answering. Specifically, we exploit detailed examination data, such as mouse cursor movements and scrolling, to infer the parts of the document the searcher found interesting, and then incorporate this signal into passage retrieval for QA. Our extensive experiments and analysis demonstrate that our method significantly improves passage retrieval, compared to using textual features alone. As an additional contribution, we make available to the research community the code and the search behavior data used in this study, with the hope of encouraging further research in this area.
3 0.74328125 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
Author: Matthew Richardson ; Christopher J.C. Burges ; Erin Renshaw
Abstract: Christopher J.C. Burges Microsoft Research One Microsoft Way Redmond, WA 98052 cburge s @micro so ft . com Erin Renshaw Microsoft Research One Microsoft Way Redmond, WA 98052 erinren@mi cros o ft . com disciplines are focused on this problem: for example, information extraction, relation extraction, We present MCTest, a freely available set of stories and associated questions intended for research on the machine comprehension of text. Previous work on machine comprehension (e.g., semantic modeling) has made great strides, but primarily focuses either on limited-domain datasets, or on solving a more restricted goal (e.g., open-domain relation extraction). In contrast, MCTest requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension. Reading comprehension can test advanced abilities such as causal reasoning and understanding the world, yet, by being multiple-choice, still provide a clear metric. By being fictional, the answer typically can be found only in the story itself. The stories and questions are also carefully limited to those a young child would understand, reducing the world knowledge that is required for the task. We present the scalable crowd-sourcing methods that allow us to cheaply construct a dataset of 500 stories and 2000 questions. By screening workers (with grammar tests) and stories (with grading), we have ensured that the data is the same quality as another set that we manually edited, but at one tenth the editing cost. By being open-domain, yet carefully restricted, we hope MCTest will serve to encourage research and provide a clear metric for advancement on the machine comprehension of text. 1 Reading Comprehension A major goal for NLP is for machines to be able to understand text as well as people. Several research 193 semantic role labeling, and recognizing textual entailment. Yet these techniques are necessarily evaluated individually, rather than by how much they advance us towards the end goal. On the other hand, the goal of semantic parsing is the machine comprehension of text (MCT), yet its evaluation requires adherence to a specific knowledge representation, and it is currently unclear what the best representation is, for open-domain text. We believe that it is useful to directly tackle the top-level task of MCT. For this, we need a way to measure progress. One common method for evaluating someone’s understanding of text is by giving them a multiple-choice reading comprehension test. This has the advantage that it is objectively gradable (vs. essays) yet may test a range of abilities such as causal or counterfactual reasoning, inference among relations, or just basic understanding of the world in which the passage is set. Therefore, we propose a multiple-choice reading comprehension task as a way to evaluate progress on MCT. We have built a reading comprehension dataset containing 500 fictional stories, with 4 multiple choice questions per story. It was built using methods which can easily scale to at least 5000 stories, since the stories were created, and the curation was done, using crowd sourcing almost entirely, at a total of $4.00 per story. We plan to periodically update the dataset to ensure that methods are not overfitting to the existing data. The dataset is open-domain, yet restricted to concepts and words that a 7 year old is expected to understand. This task is still beyond the capability of today’s computers and algorithms. ProceSe datintlges, o Wfa tsh ein 2g01to3n, C UoSnfAe,re 1n8c-e2 o1n O Ecmtopbier ic 2a0l1 M3.et ?hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 1t9ic3s–203, By restricting the concept space, we gain the difficulty of being an open-domain problem, without the full complexity of the real world (for example, there will be no need for the machine to understand politics, technology, or to have any domain specific expertise). The multiple choice task avoids ambiguities (such as when the task is to find a sentence that best matches a question, as in some early reading comprehension tasks: see Section 2), and also avoids the need for additional grading, such as is needed in some TREC tasks. The stories were chosen to be fictional to focus work on finding the answer in the story itself, rather than in knowledge repositories such as Wikipedia; the goal is to build technology that actually understands stories and paragraphs on a deep level (as opposed to using information retrieval methods and the redundancy of the web to find the answers). We chose to use crowd sourcing, as opposed to, for example, contracting teachers or paying for existing standardized tests, for three reasons, namely: (1) scalability, both for the sizes of datasets we can provide, and also for the ease of regularly refreshing the data; (2) for the variety in story-telling that having many different authors brings; and (3) for the free availability that can only result from providing non-copyrighted data. The content is freely available at http://research.microsoft.com/mct, and we plan to use that site to track published results and provide other resources, such as labels of various kinds. 2 Previous Work The research goal of mapping text to meaning representations in order to solve particular tasks has a long history. DARPA introduced the Airline Travel Information System (ATIS) in the early 90’s: there the task was to slot-fill flight-related information by modeling the intent of spoken language (see Tur et al., 2010, for a review). This data continues to be a used in the semantic modeling community (see, for example, Zettlemoyer and Collins, 2009). The Geoquery database contains 880 geographical facts about the US and has played a similar role for written (as opposed to spoken) natural language queries against a database (Zelle and Mooney, 1996) and it also continues to spur research (see for example Goldwasser et al., 2011), as does the similar Jobs database, which provides mappings of 640 sentences to a listing of jobs 194 (Tang and Mooney, 2001). More recently, Zweig and Burges (2012) provided a set of 1040 sentences that comprise an SAT-style multiple choice sentence completion task. The idea of using story-based reading comprehension questions to evaluate methods for machine reading itself goes back over a decade, when Hirschmann et al. (1999) showed that a bag of words approach, together with some heuristic linguistic modeling, could achieve 40% accuracy for the task of picking the sentence that best matches the query for “who / what / when / where / why” questions, on a small reading comprehension dataset from Remedia. This dataset spurred several research efforts, for example using reinforcement learning (Grois and Wilkins, 2005), named entity resolution (Harabagiu et al., 2003) and mapping questions and answers to logical form (Wellner et al., 2006). Work on story understanding itself goes back much further, to 1972, when Charniak proposed using a background model to answer questions about children’s stories. Similarly, the TREC (and TAC) Question Answering tracks (e.g., Voorhees and Tice, 1999) aim to evaluate systems on their ability to answer factual questions such as “Where is the Taj Mahal”. The QA4MRE task also aims to evaluate machine reading systems through question answering (e.g., Clark et al., 2012). Earlier work has also aimed at controlling the scope by limiting the text to children’s stories: Breck et al. (2001) collected 75 stories from the Canadian Broadcasting Corporation’s web site for children, and generated 650 questions for them manually, where each question was answered by a sentence in the text. Leidner et al. (2003) both enriched the CBC4kids data by adding several layers of annotation (such as semantic and POS tags), and measured QA performance as a function of question difficulty. For a further compendium of resources related to the story comprehension task, see Mueller (2010). The task proposed here differs from the above work in several ways. Most importantly, the data collection is scalable: if the dataset proves sufficiently useful to others, it would be straightforward to gather an order of magnitude more. Even the dataset size presented here is an order of magnitude larger than the Remedia or the CBC4kids data and many times larger than QA4MRE. Second, the multiple choice task presents less ambiguity (and is consequently easier to collect data for) than the ly from MC500 train set). task of finding the most appropriate sentence, and may be automatically evaluated. Further, our stories are fictional, which means that the information to answer the question is contained only in the story itself (as opposed to being able to directly leverage knowledge repositories such as Wikipedia). 195 This design was chosen to focus the task on the machine understanding of short passages, rather than the ability to match against an existing knowledge base. In addition, while in the CBC4kids data each answer was a sentence from the story, here we required that approximately half of the questions require at least two sentences from the text to answer; being able to control complexity in this way is a further benefit of using multiple choice answers. Finally, as explained in Section 1, the use of free-form input makes the problem open domain (as opposed to the ATIS, Geoquery and Jobs data), leading to the hope that solutions to the task presented here will be easier to apply to novel, unrelated tasks. 3 Generating the Stories and Questions Our aim was to generate a corpus of fictional story that could be scaled with as little expert input as possible. Thus, we designed the process to be gated by cost, and keeping the costs low was a high priority. Crowd-sourcing seemed particularly appropriate, given the nature of the task, so we opted to use Amazon Mechanical Turk2 (AMT). With over 500,000 workers3, it provides the work sets1 force required to both achieve scalability and, equally importantly, to provide diversity in the stories and types of questions. We restricted our task to AMT workers (workers) residing in the United States. The average worker is 36 years old, more educated than the United States population in general (Paolacci et al., 2010), and the majority of workers are female. 3.1 The Story and Questions Workers were instructed to write a short (150-300 words) fictional story, and to write as if for a child in grade school. The choice of 150-300 was made to keep the task an appropriate size for workers while still allowing for complex stories and questions. The workers were free to write about any topic they desired (as long as it was appropriate for a young child), and so there is a wide range, including vacations, animals, school, cars, eating, gardening, fairy tales, spaceships, and cowboys. 1 We use the term “story set” to denote the fictional story together with its multiple choice questions, hypothetical answers, and correct answer labels. 2 http://www.mturk.com 3 https://requester.mturk.com/tour Workers were also asked to provide four reading comprehension questions pertaining to their story and, for each, four multiple-choice answers. Coming up with incorrect alternatives (distractors) is a difficult task (see, e.g., Agarwal, 2011) but workers were requested to provide “reasonable” incorrect answers that at least include words from the story so that their solution is not trivial. For example, for the question “What is the name of the dog?”, if only one of the four answers occurs in the story, then that answer must be the correct one. Finally, workers were asked to design their questions and answers such that at least two of the four questions required multiple sentences from the story to answer them. That is, for those questions it should not be possible to find the answer in any individual sentence. The motivation for this was to ensure that the task could not be fully solved using lexical techniques, such as word matching, alone. Whilst it is still possible that a sophisticated lexical analysis could completely solve the task, requiring that answers be constructed from at least two different sentences in the story makes this much less likely; our hope is that the solution will instead require some inference and some form of limited reasoning. This hope rests in part upon the observation that standardized reading comprehension tests, whose goal after all is to test comprehension, generally avoid questions that can be answered by reading a single sentence. 3.2 Automatic Validation Besides verifying that the story and all of the questions and answers were provided, we performed the following automatic validation before allowing the worker to complete the task: Limited vocabulary: The lowercase words in the story, questions, and answers were stemmed and checked against a vocabulary list of approximately 8000 words that a 7-year old is likely to know (Kuperman et al., 2012). Any words not on the list were highlighted in red as the worker typed, and the task could not be submitted unless all of the words satisfied this vocabulary criterion. To allow the use of arbitrary proper nouns, capitalized words were not checked against the vocabulary list. Multiple-sentence questions: As described earlier, we required that at least two of the questions need multiple sentences to answer. Workers were simply asked to mark whether a question needs one 196 or multiple sentences and we required that at least two are marked as multiple. 3.3 The Workers Workers were required to reside in the United States and to have completed 100 HITs with an over 95% approval The median worker took 22 minutes to complete the task. We paid workers $2.50 per story set and allowed each to do a maximum of 8 tasks (5 in MC500). We did not experiment with paying less, but this rate amounts to $6.82/hour, which is approximately the rate paid by other writing tasks on AMT at the time, though is also significantly higher than the median wage of $1.38 found in 2010 (Horton and Chilton, 2010). Workers could optionally leave feedback on the task, which was overwhelmingly positive – the most frequent non-stopword in the comments was “fun” and the most frequent phrase was “thank you”. The only negative comments (in <1% of submissions) were when the worker felt that a particular word should have been on the allowed vocabulary list. Given the positive feedback, it may be possible to pay less if we collect more data in the future. We did not enforce story length constraints, but some workers interpreted our suggestion that the story be 150-300 words as a hard rate4. constraint, and some asked to be able to write a longer story. The MCTest corpus contains two sets of stories, named MC160 and MC500, and containing 160 and 500 stories respectively. MC160 was gathered first, then some improvements were made before gathering MC500. We give details on the differences between these two sets below. 3.4 MC160: Manually Curated for Quality In addition to the details described above, MC160 workers were given a target elementary grade school level (1-4) and a sample story matching that grade level5. The intent was to produce a set of stories and questions that varied in difficulty so that research work can progress grade-by-grade if needed. However, we found little difference between grades in the corpus.. After gathering the stories, we manually curated the MC160 corpus by reading each story set and 4 The latter two are the default AMT requirements. 5 From http://www.englishforeveryone.org/. correcting errors. The most common mistakes were grammatical, though occasionally questions and/or answers needed to be fixed. 66% of the stories have at least one correction. We provide both the curated and original corpuses in order to allow research on reading comprehension in the presence of grammar, spelling, and other mistakes. 3.5 MC500: Adding a Grammar Test Though the construction of MC160 was successful, it requires a costly curation process which will not scale to larger data sets (although the curation was useful, both for improving the design of MC500, and for assessing the effectiveness of automated curation techniques). To more fully automate the process, we added two more stages: (1) A grammar test that automatically pre-screens workers for writing ability, and (2) a second Mechanical Turk task whereby new workers take the reading comprehension tests and rate their quality. We will discuss stage (2) in the next section. The grammar test consisted of 20 sentences, half of which had one grammatical error (see Figure 2). The incorrect sentences were written using common errors such as you ’re vs. your, using ‘s to indicate plurality, incorrect use of tense, it’ ’s vs. its, 197 NGoGramram mar TreTsets Q(u134a-.52l3i) tyaAn73ib m0o% aults Table 1. Pre-screening workers using a grammar test improves both quality and diversity of stories. Both differences are significant using the two-tailed t-test (p<0.05 for quality and p<0.01 for animals). less vs. fewer, I me, etc. Workers were required vs. to indicate for each sentence whether it was grammatically correct or not, and had to pass with at least 80% accuracy in order to qualify for the task. The 80% threshold was chosen to trade off worker quality with the rate at which the tasks would be completed; initial experiments using a threshold of 90% indicated that collecting 500 stories would take many weeks instead of days. Note that each worker is allowed to write at most 5 stores, so we required at least 100 workers to pass the qualification test. To validate the use of the qualification test, we gathered 30 stories requiring the test (qual) and 30 stories without. We selected a random set of 20 stories (10 from each), hid their origin, and then graded the overall quality of the story and questions from 1-5, meaning do not attempt to fix, bad but rescuable, has non-minor problems, has only minor problems, and has no problems, respectively. Results are shown in Table 1. The difference is statistically significant (p<0.05, using the twotailed t-test). The qual stories were also more diverse, with fewer of them about animals (the most common topic). Additional Modifications: Based on our experience curating MC160, we also made the following modifications to the task. In order to eliminate trivially-answerable questions, we required that each answer be unique, and that either the correct answer did not appear in the story or, if it did appear, that at least two of the incorrect answers also appeared in the story. This is to prevent questions that are trivially answered by checking which answer appears in the story. The condition on whether the correct answer appears is to allow questions such as “How many candies did Susan eat?”, where the total may never appear in the story, even though the information needed to derive it does. An answer is considered to appear in the story if at least half (rounded down) of its non-stopword terms appear in the story (ignoring word endings). This check is done automatically and must be satisfied before the worker is able to complete the task. Workers could also bypass the check if they felt it was incorrect, by adding a special term to their answer. We were also concerned that the sample story might bias the workers when writing the story set, particularly when designing questions that require multiple sentences to answer. So, we removed the sample story and grade level from the task. Finally, in order to encourage more diversity of stories, we added creativity terms, a set of 15 nouns chosen at random from the allowed vocabulary set. Workers were asked to “please consider” using one or more of the terms in their story, but use of the words was strictly optional. On average, workers used 3.9 of the creativity terms in their stories. 4 Rating the Stories and Questions In this section we discuss the crowd-sourced rating of story sets. We wished to ensure story set quality despite the fact that MC500 was only minimally manually curated (see below). Pre-qualifying workers with a grammar test was one step of this process. The second step was to have additional workers on Mechanical Turk both evaluate each story and take its corresponding test. Each story was evaluated in this way by 10 workers, each of whom provided scores for each of ageappropriateness (yes/maybe/no), grammaticality (few/some/many errors), and story clarity (excellent/reasonable/poor). When answering the four reading comprehension questions, workers could also mark a question as “unclear”. Each story set was rated by 10 workers who were each paid $0. 15 per set. Since we know the purportedly correct answer, we can estimate worker quality by measuring what fraction of questions that worker got right. Workers with less than 80% accuracy (ignoring those questions marked as unclear) were removed from the set. This constituted just 4.1% of the raters and 4.2% of the judgments (see Figure 3). Only one rater appeared to be an intentional spammer, answering 1056 questions with only 29% accuracy. The others primarily judged only one story. Only one worker fell between, answering 336 questions with just 75% accuracy. 198 Figure 3. Just 4.1% of raters had an accuracy below 80% (constituting 4.2% of the judgments). For the remaining workers (those who achieved at least 80% accuracy), we measured median story appropriateness, grammar, and clarity. For each category, stories for which less than half of the ratings were the best possible (e.g., excellent story clarity) were inspected and optionally removed from the data set. This required inspecting 40 (<10%) of the stories, only 2 of which were deemed poor enough to be removed (both of which had over half of the ratings all the way at the bot- tom end of the scale, indicating we could potentially have inspected many fewer stories with the same results). We also inspected questions for which at least 5 workers answered incorrectly, or answered “unclear”. In total, 29 questions (<2%) were inspected. 5 were fixed by changing the question, 8 by changing the answers, 2 by changing both, 6 by changing the story, and 8 were left unmodified. Note that while not fully automated, this process of inspecting stories and repairing questions took one person one day, so is still scalable to at least an order of magnitude more stories. 5 Dataset Analysis In Table 2, we present results demonstrating the value of the grammar test and curation process. As expected, manually curating MC160 resulted in increased grammar quality and percent of questions answered correctly by raters. The goal of MC500 was to find a more scalable method to achieve the same quality as the curated MC160. As Table 2 shows, the grammar test improved story grammar quality from 1.70 to 1.77 (both uncurated). The rating and one-day curation process in- TS51a06bet0l cu2r.ateAdvra1 Ag.9e8241aAgpe1 Cp.67la13r57oiptyrae1 Gn.7er8a90s74mǂ, satroy9C 567oc.lr397a eritcy, grammar quality (0-2, with 2 being best), and percent of questions answered correctly by raters, for the original and curated versions of the data. Bold indicates statistical significance vs. the original version of the same set, using the two-sample t-test with unequal variance. The indicates the only statistical difference between 500 curated and 160 curated. ǂ TMCaob lre5p10u63s. CoSr51tp06uise tawM2it06sreimt dcinsea gfnorS2Mt1o0C2r4Ay1v6eQ0raug7e8n.ds0W7tMionCrd5s0APne3.sr:4w er creases this to 1.79, whereas a fully manual curation results in a score of 1.84. Curation also improved the percent of questions answered correctly for both MC160 and MC500, but, unlike with grammar, there is no significant difference between the two curated sets. Indeed, the only statis- tically significant difference between the two is in grammar. So, the MC500 grammar test and curation process is a very scalable method for collecting stories of nearly the quality of the costly manual curation of MC160. We also computed correlations between these measures of quality and various factors such as story length and time spent writing the story. On MC500, there is a mild correlation between a worker’s grammar test score and the judged grammar quality of that worker’s story (correlation of 0.24). Interestingly, this relation disappeared once MC500 was curated, likely due to repairing the stories with the worst grammar. On MC160, there is a mild correlation between the clarity and the number of words in the question and answer (0.20 and 0.18). All other correlations were below 0. 15. These factors could be integrated into an estimate for age-appropriateness, clarity, and grammar, potentially reducing the need for raters. Table 3 provides statistics on each corpus. MC160 and MC500 are similar in average number of words per story, question, and answer, as well as the median writing time. The most commonly used 199 Baseline Algorithms Require: Passage P, set of passage words PW, ith word in passage Pi, set of words in question Q, set of words in hypothesized answers A1..4, and set of stop words U, Define: ( ) ∑( ) Define: ( ) ( ( )). Algorithm 1 Sliding Window Algorithm 1 Sliding Window for i= 1to 4 do | | ∑ | |{ ( ) end for return Algorithm 2 Distance Based for i= 1to 4 do ( ) (( ) ) if | | else or | | | |( ), where ()is the minimum number of words between an occurrence of q and an occurrence of a in P, plus one. end if end for return Algorithm Return SW Algorithm SW+D Return Figure 4. The two lexical-based algorithms used for the baselines. nouns in MC500 are: day, friend, time, home, house, mother, dog, mom, school, dad, cat, tree, and boy. The stories vary widely in theme. The first 10 stories of the randomly-ordered MC500 set are about: travelling to Miami to visit friends, waking up and saying hello to pets, a bully on a schoolyard, visiting a farm, collecting insects at Grandpa’s house, planning a friend’s birthday party, selecting clothes for a school dance, keeping animals from eating your ice cream, animals ordering food, and adventures of a boy and his dog. TSMaAiblCnuge1tli460.Per5TcS9reW.an54it360 caonrQdS6e’W7c8Dst.4+1e7vfD5o:rthem465S8u.W4l28t93ip Tl4e0cQsht:o’S5i76Wc e.8+2q1D95ues- tions for MC160. SW: sliding window algorithm. SW+D: combined results with sliding window and distance based algorithms. Single/Multi: questions marked by worker as requiring a single/multiple sentence(s) to answer. All differences between SW and SW+D are significant (p<0.01 using the two-tailed paired t-test). TASMabiluCnge5tli0.Pe5T4rSc92a.We18ni304t ac0noSQrd56W’8e1D.sc2+7et8Dv1fo:rt5hS1eW.85m603uTletQsiSp5W:l’76es.31+c570hDoiceS65Wq0A7u.4l5+e 3Ds- tions for MC500, notation as above. All differences between SW and SW+D are significant (p<0.01, tested as above). We randomly divided MC160 and MC500 into train, development, and test sets of 70, 30, and 60 stories and 300, 50, and 150 stories, respectively. 6 Baseline System and Results We wrote two baseline systems, both using only simple lexical features. The first system used a sliding window, matching a bag of words constructed from the question and hypothesized answer to the text. Since this ignored long range dependencies, we added a second, word-distance based algorithm. The distance-based score was simply subtracted from the window-based score to arrive at the final score (we tried scaling the distance score before subtraction but this did not improve results on the MC160 train set). The algorithms are summarized in Figure 4. A coin flip is used to break ties. The use of inverse word counts was inspired by TF-IDF. Results for MC160 and MC500 are shown in Table 4 and Table 5. The MC160 train and development sets were used for tuning. The baseline algorithm was authored without seeing any portion of MC500, so both the MC160 test set and all of 200 BRCoaTsmEelbin e d(SW+D)65 M967C. 76219506ǂ 0Test5 6M603C. 685 7320ǂ 0Test Table 6. Percent correct for MC160 and MC500 test sets. The ǂ indicates statistical significance vs. baseline (p<0.01 using the two-tailed paired t-test). MC160 combined vs. baseline has p-value 0.063. MC500 were used for testing (although we nevertheless report results on the train/test split). Note that adding the distance based algorithm improved accuracy by approximately 10% absolute on MC160 and approximately 6% on MC500. Overall, error rates on MC500 are higher than on MC160, which agrees with human performance (see Table 2), suggesting that MC500’s questions are more difficult. 7 Recognizing Textual Entailment Results We also tried using a “recognizing textual entailment” (RTE) system to answer MCTest questions. The goal of RTE (Dagan et al., 2005) is to determine whether a given statement can be inferred from a particular text. We can cast MCTest as an RTE task by converting each question-answer pair into a statement, and then selecting the answer whose statement has the highest likelihood of being entailed by the story. For example, in the sample story given in Figure 1, the second question can be converted into four statements (one for each answer), and the RTE system should select the statement “James pulled pudding off of the shelves in the grocery store” as the most likely one. For converting question-answer pairs to statements, we used the rules employed in a web-based question answering system (Cucerzan and Agichtein, 2005). For RTE, we used BIUTEE (Stern and Dagan, 2011), which performs better than the median system in the past four RTE competitions. We ran BIUTEE both in its default configuration, as well as with its optional additional data sources (FrameNet, ReVerb, DIRT, and others as found on the BIUTEE home page). The default configuration performed better so we present its results here. The results in Table 6 show that the RTE method performed worse than the baseline. We also combined the baseline and RTE system by training BIUTEE on the train set and using the development set to optimize a linear combination of BIUTEE with the baseline; the combined system outperforms either component system on MC500. It is possible that with some tuning, an RTE system will outperform our baseline system. Nevertheless, these RTE results, and the performance of the baseline system, both suggest that the reading comprehension task described here will not be trivially solved by off-the-shelf techniques. 8 Making Data and Results an Ongoing Resource Our goal in constructing this data is to encourage research and innovation in the machine comprehension of text. Thus, we have made both MC160 and MC500 freely available for download at http://research.microsoft.com/mct. To our knowledge, these are the largest copyright-free reading comprehension data sets publicly available. To further encourage research on these data, we will be continually updating the webpage with the bestknown published results to date, along with pointers to those publications. One of the difficulties in making progress on a particular task is implementing previous work in order to apply improvements to it. To mitigate this difficulty, we are encouraging researchers who use the data to (optionally) provide per-answer scores from their system. Doing so has three benefits: (a) a new system can be measured in the context of the errors made by the previous systems, allowing each research effort to incrementally add useful functionality without needing to also re-implement the current state-of-the-art; (b) it allows system performance to be measured using paired statistical testing, which will substantially increase the ability to determine whether small improvements are significant; and (c) it enables researchers to perform error analysis on any of the existing systems, simplifying the process of identifying and tackling common sources of error. We will also periodically ensemble the known systems using standard machine learning techniques and make those results available as well (unless the existing state-of-theart already does such ensembling). The released data contains the stories and questions, as well as the results from workers who rated 201 the stories and took the tests. The latter may be used, for example, to measure machine performance vs. human performance on a per-question basis (i.e., does your algorithm make similar mistakes to humans?), or vs. the judged clarity of each story. The ratings, as well as whether a question needs multiple sentences to answer, should typically only be used in evaluation, since such information is not generally available for most text. We will also provide an anonymized author id for each story, which could allow additional research such as using other works by the same author when understanding a story, or research on authorship attribution (e.g., Stamatatos, 2009). 9 Future Work We plan to use this dataset to evaluate approaches for machine comprehension, but are making it available now so that others may do the same. If MCTest is used we will collect more story sets and will continue to refine the collection process. One interesting research direction is ensuring that the questions are difficult enough to challenge state-ofthe-art techniques as they develop. One idea for this is to apply existing techniques automatically during story set creation to see whether a question is too easily answered by a machine. By requiring authors to create difficult questions, each data set will be made more and more difficult (but still answerable by humans) as the state-of-the-art methods advance. We will also experiment with timing the raters as they answer questions to see if we can find those that are too easy for people to answer. Removing such questions may increase the difficulty for machines as well. Additionally, any divergence between how easily a person answers a question vs. how easily a machine does may point toward new techniques for improving machine comprehension; we plan to conduct research in this direction as well as make any such data available for others. 10 Conclusion We present the MCTest dataset in the hope that it will help spur research into the machine comprehension of text. The metric (the accuracy on the question sets) is clearly defined, and on that metric, lexical baseline algorithms only attain approximately 58% correct on test data (the MC500 set) as opposed to the 100% correct that the majority of crowd-sourced judges attain. A key component of MCTest is the scalable design: we have shown that data whose quality approaches that of expertly cu- rated data can be generated using crowd sourcing coupled with expert correction of worker-identified errors. Should MCTest prove useful to the community, we will continue to gather data, both to increase the corpus size, and to keep the test sets fresh. The data is available at http://research.microsoft.com/mct and any submitted results will be posted there too. Because submissions will be requested to include the score for each test item, researchers will easily be able to compare their systems with those of others, and investigation of ensembles comprised of components from several different teams will be straightforward. MCTest also contains supplementary material that researchers may find useful, such as worker accuracies on a grammar test and crowd-sourced measures of the quality of their stories. Acknowledgments We would like to thank Silviu Cucerzan and Lucy Vanderwende for their help with converting questions to statements and other useful discussions. References M. Agarwal and P. Mannem. 2011. Automatic Gap-fill Question Generation from Text Books. In Proceed- ings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications, 56–64. E. Breck, M. Light, G.S.Mann, E. Riloff, B. Brown, P. Anand, M. Rooth M. Thelen. 2001. Looking under the hood: Tools for diagnosing your question answering engine. In Proceedings of the workshop on Opendomain question answering, 12, 1-8. E. Charniak. 1972. Toward a Model of Children’s Story Comprehension. Technical Report, 266, MIT Artificial Intelligence Laboratory, Cambridge, MA. P. Clark, P. Harrison, and X. Yao. An Entailment-Based Approach to the QA4MRE Challenge. 2012. In Proceedings of the Conference and Labs of the Evaluation Forum (CLEF) 2012. S. Cucerzan and E. Agichtein. 2005. Factoid Question Answering over Unstructured and Structured Content on the Web. In Proceedings of the Fourteenth Text Retrieval Conference (TREC). I. Dagan, O. Glickman, and B. Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini, F. d'Alché-Buc (Eds.), Machine Learning 202 Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer. D. Goldwasser, R. Reichart, J. Clarke, D. Roth. 2011. Confidence Driven Unsupervised Semantic Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 1486-1495. E. Grois and D.C. Wilkins. 2005. Learning Strategies for Story Comprehension: A Reinforcement Learning Approach. In Proceedings of the Twenty Second International Conference on Machine Learning, 257264. S.M. Harabagiu, S.J. Maiorano, and M.A. Pasca. 2003. Open-Domain Textual Question Answering Techniques. Natural Language Engineering, 9(3): 1-38. Cambridge University Press, Cambridge, UK. L. Hirschman, M. Light, E. Breck, and J.D. Burger. 1999. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL), 325-332. J. Horton and L. Chilton. 2010. The labor economics of paid crowdsourcing. In Proceedings of the 11th ACM Conference on Electronic Commerce, 209-218. V. Kuperman, H. Stadthagen-Gonzalez, M. Brysbaert. 2012. Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4):978-990. J.L. Leidner, T. Dalmas, B. Webber, J. Bos, C. Grover. 2003. Automatic Multi-Layer Corpus Annotation for Evaluating Question Answering Methods: CBC4Kids. In Proceedings of the 3rd International Workshop on Linguistically Interpreted Corpora. E.T. Mueller. 2010. Story Understanding Resources. http://xenia.media.mit.edu/~mueller/storyund/storyre s.html. G. Paolacci, J. Chandler, and P. Iperirotis. 2010. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making. 5(5):41 1-419. E. Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci., 60:538– 556. A. Stern and I. Dagan. 2011. A Confidence Model for Syntactically-Motivated Entailment Proofs. In Proceedings of Recent Advances in Natural Language Processing (RANLP). L.R. Tang and R.J. Mooney. 2001. Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing. In Proceedings of the European Conference on Machine Learning (ECML), 466-477. G. Tur, D. Hakkani-Tur, and L.Heck. 2010. What is left to be understood in ATIS? Spoken Language Technology Workshop, 19-24. E.M. Voorhees and D.M. Tice. 1999. The TREC-8 Question Answering Track Evaluation. In Proceedings of the Eighth Text Retrieval Conference (TREC8). 12th Wellner, L. Ferro, W. Greiff, and L. Hirschman. 2005. Reading comprehension tests for computerbased understand evaluation. Natural Language Engineering, 12(4):305-334. Cambridge University Press, Cambridge, UK. J.M. Zelle and R.J. Mooney. 1996. Learning to Parse Database Queries using Inductive Logic Programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI), 10501055. B. Zettlemoyer and M. Collins. 2009. Learning Context-Dependent Mappings from Sentences to Logical Form. In Proceedings of the 47th Annual Meeting of the Association for Computation Linguistics (ACL), 976-984. G. Zweig and C.J.C. Burges. 2012. A Challenge Set for Advancing Language Modeling. In Proceedings of the Workshop on the Future of Language Modeling for HLT, NAACL-HLT. L.S. 203
4 0.7117483 31 emnlp-2013-Automatic Feature Engineering for Answer Selection and Extraction
Author: Aliaksei Severyn ; Alessandro Moschitti
Abstract: This paper proposes a framework for automatically engineering features for two important tasks of question answering: answer sentence selection and answer extraction. We represent question and answer sentence pairs with linguistic structures enriched by semantic information, where the latter is produced by automatic classifiers, e.g., question classifier and Named Entity Recognizer. Tree kernels applied to such structures enable a simple way to generate highly discriminative structural features that combine syntactic and semantic information encoded in the input trees. We conduct experiments on a public benchmark from TREC to compare with previous systems for answer sentence selection and answer extraction. The results show that our models greatly improve on the state of the art, e.g., up to 22% on F1 (relative improvement) for answer extraction, while using no additional resources and no manual feature engineering.
Author: Baichuan Li ; Jing Liu ; Chin-Yew Lin ; Irwin King ; Michael R. Lyu
Abstract: Social media like forums and microblogs have accumulated a huge amount of user generated content (UGC) containing human knowledge. Currently, most of UGC is listed as a whole or in pre-defined categories. This “list-based” approach is simple, but hinders users from browsing and learning knowledge of certain topics effectively. To address this problem, we propose a hierarchical entity-based approach for structuralizing UGC in social media. By using a large-scale entity repository, we design a three-step framework to organize UGC in a novel hierarchical structure called “cluster entity tree (CET)”. With Yahoo! Answers as a test case, we conduct experiments and the results show the effectiveness of our framework in constructing CET. We further evaluate the performance of CET on UGC organization in both user and system aspects. From a user aspect, our user study demonstrates that, with CET-based structure, users perform significantly better in knowledge learning than using traditional list-based approach. From a system aspect, CET substantially boosts the performance of two information retrieval models (i.e., vector space model and query likelihood language model).
6 0.43215239 200 emnlp-2013-Well-Argued Recommendation: Adaptive Models Based on Words in Recommender Systems
7 0.42742005 59 emnlp-2013-Deriving Adjectival Scales from Continuous Space Word Representations
8 0.38247317 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior
9 0.35081646 173 emnlp-2013-Simulating Early-Termination Search for Verbose Spoken Queries
10 0.32226536 203 emnlp-2013-With Blinkers on: Robust Prediction of Eye Movements across Readers
11 0.31337908 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
12 0.29261571 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!
13 0.27219418 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game
15 0.25676802 121 emnlp-2013-Learning Topics and Positions from Debatepedia
17 0.24292843 166 emnlp-2013-Semantic Parsing on Freebase from Question-Answer Pairs
18 0.23392594 137 emnlp-2013-Multi-Relational Latent Semantic Analysis
19 0.21486147 4 emnlp-2013-A Dataset for Research on Short-Text Conversations
20 0.21336116 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication
topicId topicWeight
[(3, 0.036), (9, 0.01), (18, 0.038), (22, 0.04), (30, 0.043), (51, 0.144), (66, 0.015), (71, 0.022), (75, 0.031), (77, 0.012), (90, 0.47), (96, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.83506179 155 emnlp-2013-Question Difficulty Estimation in Community Question Answering Services
Author: Jing Liu ; Quan Wang ; Chin-Yew Lin ; Hsiao-Wuen Hon
Abstract: In this paper, we address the problem of estimating question difficulty in community question answering services. We propose a competition-based model for estimating question difficulty by leveraging pairwise comparisons between questions and users. Our experimental results show that our model significantly outperforms a PageRank-based approach. Most importantly, our analysis shows that the text of question descriptions reflects the question difficulty. This implies the possibility of predicting question difficulty from the text of question descriptions.
2 0.80192864 33 emnlp-2013-Automatic Knowledge Acquisition for Case Alternation between the Passive and Active Voices in Japanese
Author: Ryohei Sasano ; Daisuke Kawahara ; Sadao Kurohashi ; Manabu Okumura
Abstract: We present a method for automatically acquiring knowledge for case alternation between the passive and active voices in Japanese. By leveraging several linguistic constraints on alternation patterns and lexical case frames obtained from a large Web corpus, our method aligns a case frame in the passive voice to a corresponding case frame in the active voice and finds an alignment between their cases. We then apply the acquired knowledge to a case alternation task and prove its usefulness.
Author: Eva Maria Vecchi ; Roberto Zamparelli ; Marco Baroni
Abstract: In this study, we use compositional distributional semantic methods to investigate restrictions in adjective ordering. Specifically, we focus on properties distinguishing AdjectiveAdjective-Noun phrases in which there is flexibility in the adjective ordering from those bound to a rigid order. We explore a number of measures extracted from the distributional representation of AAN phrases which may indicate a word order restriction. We find that we are able to distinguish the relevant classes and the correct order based primarily on the degree of modification of the adjectives. Our results offer fresh insight into the semantic properties that determine adjective ordering, building a bridge between syntax and distributional semantics.
4 0.67908669 29 emnlp-2013-Automatic Domain Partitioning for Multi-Domain Learning
Author: Di Wang ; Chenyan Xiong ; William Yang Wang
Abstract: Chenyan Xiong School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA cx@ c s . cmu .edu William Yang Wang School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA ww@ cmu .edu might not be generalizable to other domains (BenDavid et al., 2006; Ben-David et al., 2010). Multi-Domain learning (MDL) assumes that the domain labels in the dataset are known. However, when there are multiple metadata at- tributes available, it is not always straightforward to select a single best attribute for domain partition, and it is possible that combining more than one metadata attributes (including continuous attributes) can lead to better MDL performance. In this work, we propose an automatic domain partitioning approach that aims at providing better domain identities for MDL. We use a supervised clustering approach that learns the domain distance between data instances , and then cluster the data into better domains for MDL. Our experiment on real multi-domain datasets shows that using our automatically generated domain partition improves over popular MDL methods.
5 0.62822658 192 emnlp-2013-Unsupervised Induction of Contingent Event Pairs from Film Scenes
Author: Zhichao Hu ; Elahe Rahimtoroghi ; Larissa Munishkina ; Reid Swanson ; Marilyn A. Walker
Abstract: Human engagement in narrative is partially driven by reasoning about discourse relations between narrative events, and the expectations about what is likely to happen next that results from such reasoning. Researchers in NLP have tackled modeling such expectations from a range of perspectives, including treating it as the inference of the CONTINGENT discourse relation, or as a type of common-sense causal reasoning. Our approach is to model likelihood between events by drawing on several of these lines of previous work. We implement and evaluate different unsupervised methods for learning event pairs that are likely to be CONTINGENT on one another. We refine event pairs that we learn from a corpus of film scene descriptions utilizing web search counts, and evaluate our results by collecting human judgments ofcontingency. Our results indicate that the use of web search counts increases the av- , erage accuracy of our best method to 85.64% over a baseline of 50%, as compared to an average accuracy of 75. 15% without web search.
6 0.38075724 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts
7 0.37931839 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes
8 0.37747771 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
9 0.37738132 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
10 0.37664452 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
11 0.3672812 189 emnlp-2013-Two-Stage Method for Large-Scale Acquisition of Contradiction Pattern Pairs using Entailment
12 0.36546841 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
13 0.36501795 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?
14 0.36456925 98 emnlp-2013-Image Description using Visual Dependency Representations
15 0.36377656 152 emnlp-2013-Predicting the Presence of Discourse Connectives
17 0.3625665 160 emnlp-2013-Relational Inference for Wikification
18 0.3583253 126 emnlp-2013-MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
19 0.3537384 87 emnlp-2013-Fish Transporters and Miracle Homes: How Compositional Distributional Semantics can Help NP Parsing
20 0.35341176 183 emnlp-2013-The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of Verbs