acl acl2011 acl2011-91 knowledge-graph by maker-knowledge-mining

91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Source: pdf

Author: Paul Piwek ; Svetlana Stoyanchev

Abstract: This short paper introduces an implemented and evaluated monolingual Text-to-Text generation system. The system takes monologue and transforms it to two-participant dialogue. After briefly motivating the task of monologue-to-dialogue generation, we describe the system and present an evaluation in terms of fluency and accuracy.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Data-oriented Monologue-to-Dialogue Generation Paul Piwek Centre for Research in Computing The Open University Walton Hall, Milton Keynes, UK p piwek @ open ac uk . [sent-1, score-0.258]

2 Abstract This short paper introduces an implemented and evaluated monolingual Text-to-Text generation system. [sent-4, score-0.067]

3 The system takes monologue and transforms it to two-participant dialogue. [sent-5, score-0.497]

4 After briefly motivating the task of monologue-to-dialogue generation, we describe the system and present an evaluation in terms of fluency and accuracy. [sent-6, score-0.17]

5 1 Introduction Several empirical studies show that delivering information in the form of a dialogue, as opposed to monologue, can be particularly effective for education (Craig et al. [sent-7, score-0.03]

6 Informationdelivering or expository dialogue was already employed by Plato to communicate his philosophy. [sent-10, score-0.673]

7 It is used primarily to convey information and possibly also make an argument; this in contrast with dramatic dialogue which focuses on character development and narrative. [sent-11, score-0.604]

8 Expository dialogue lends itself well for presentation through computer-animated agents (Prendinger and Ishizuka, 2004). [sent-12, score-0.714]

9 Automatic generation of dialogue from text in monologue makes it possible to convert information into dialogue as and when needed. [sent-14, score-1.772]

10 The approach is dataoriented in that the mapping rules have been automatically derived from an annotated parallel monologue/dialogue corpus, rather than being handcrafted. [sent-19, score-0.062]

11 2 Related Work For the past decade, generation of informationdelivering dialogues has been approached primarily as an AI planning task. [sent-25, score-0.255]

12 (2000) describe a system, based on a centralised dialogue planner, that creates dialogues between a virtual car buyer and seller from a database; this approach has been extended by van Deemter et al. [sent-27, score-0.796]

13 Others have used (semi-) autonomous agents for dialogue generation (Cavazza and Charles, 2005; Mateas and Stern, 2005). [sent-29, score-0.734]

14 More recently, first steps have been taken towards treating dialogue generation as an instance of Textto-Text generation (Rus et al. [sent-30, score-0.738]

15 , 2007) employs rules that map text annotated with discourse structures, along the lines of Rhetorical Structure Theory (Mann and Thompson, 1988), to specific dialogue sequences. [sent-33, score-0.757]

16 Common to all the approaches discussed so far has been the manual creation of generation resources, whether it be mappings from knowledge representations or discourse to dialogue structure. [sent-34, score-0.857]

17 This corpus consists of approximately 700 turns of dialogue, by acclaimed authors such as Mark Twain, that are aligned with monologue that was written on the basis of the dialogue, with the specific aim to express the same information as the dialogue. [sent-38, score-0.497]

18 2 The monologue side has been annotated with discourse relations, using an adaptation of the annotation guide- lines of Carlson and Marcu (2001), whereas the dialogue side has been marked up with dialogue acts, using tags inspired by the schemes of Bunt (2000), Carletta et al. [sent-39, score-1.844]

19 As we will describe in the next section, our approach uses the CODA corpus to extract mappings from monologue to dialogue. [sent-41, score-0.569]

20 3 Monologue-to-Dialogue Generation Approach Our approach is based on five principal steps: IDiscourse parsing: analysis of the input monologue in terms of the underlying discourse relations. [sent-42, score-0.611]

21 II Relation conversion: mapping of text annotated with discourse relations to a sequence of dialogue acts, with segments of the input text assigned to corresponding dialogue acts. [sent-43, score-1.384]

22 III Verbalisation: verbal realisation of dialogue acts based on the dialogue act type and text of the corresponding monologue segment. [sent-44, score-1.836]

23 IV Combination Putting the verbalised dialogues acts together to create a complete dialogue, and V Presentation: Rendering of the dialogue (this can range for simple textual dialogue scripts to computer-animated spoken dialogue). [sent-45, score-1.418]

24 html 2Consequently, the corpus was not constructed entirely of pre-existing text; some of the text was authored as part of the corpus construction. [sent-50, score-0.028]

25 243 For step I rely on human annotation or existing we discourse parsers such as DAS (Le and Abeysinghe, 2003) and HILDA (duVerle and Prendinger, 2009). [sent-52, score-0.114]

26 For the current study, the final step, V, consists simply of verbatim presentation of the dialogue text. [sent-53, score-0.651]

27 Step II is data-oriented in that we have extracted mappings from discourse relation occurrences in the corpus to corresponding dialogue act sequences, following the approach described in Piwek and Stoyanchev (2010). [sent-55, score-0.883]

28 Table 1shows the mapping from text with a discourse relations to dialogue act sequences (i indicates implemented mappings). [sent-57, score-0.881]

29 DA sequenceACDCTERMMTTR Table 1: Mappings from discourse relations (A = Attribu- tion, CD = Condition, CT = Contrast, ER = ExplanationReason, MM = Manner-Means) to dialogue act sequences (explained below) together with the type of verbalisation transformation TR being d(irect) or c(omplex). [sent-58, score-0.968]

30 For comparison, the table also shows the much less varied mappings implemented by the T2D system (indicated with t). [sent-59, score-0.072]

31 Note that the actual mappings of the T2D system are directly from discourse relation to dialogue text. [sent-60, score-0.813]

32 The dialogue acts are not explicitly represented by the system, in contrast with the current two stage approach which distinguishes between relation conversion and verbalisation. [sent-61, score-0.688]

33 Verbalisation, step III, takes a dialogue act type and the specification of its semantic content as given by the input monologue text. [sent-62, score-1.171]

34 Mapping this to the appropriate dialogue act requires mappings that vary in complexity. [sent-63, score-0.746]

35 For example, Expl(ain) can be generated by simply copying a monologue segment to dialogue utterance. [sent-64, score-1.168]

36 The dialogue acts Yes and Agreement can be generated using canned text, such as “That is true” and “I agree with you”. [sent-65, score-0.755]

37 To generate YNQ and FactQ, we use the CMU Question Generation tool (Heilman and Smith, 2010) which is based on a combination of syntactic transformation rules implemented with tregex (Levy and Andrew, 2006) and statistical methods. [sent-67, score-0.103]

38 To generate the Compl(ex) Q(uestion) in the ComplQ;Expl Dialogue Act (DA) sequence, we use a combination ofthe CMU tool and lexical transformation rules. [sent-68, score-0.032]

39 3 The GEN example in Table 2 illustrates this: The input monologue has a MannerMeans relations between the nucleus ‘In September, Ashland settled the long-simmering dispute’ and the satellite ‘by agreeing to pay Iran 325 million USD’ . [sent-69, score-0.742]

40 The satellite is copied without alteration to the Explain dialogue act. [sent-70, score-0.655]

41 ’) that is obtained with the CMU QG tool from the declarative input sentence. [sent-72, score-0.022]

42 A similar approach is applied for the other relations (Attribution, Condition and Explanation-Reason) that can lead to a ComplQ; Expl dialogue act sequence (see Table 1). [sent-73, score-0.713]

43 Generally, sequences requiring only copying or canned text are labelled d(irect) in Table 1, whereas those requiring syntactic transformation are labelled c(omplex). [sent-74, score-0.142]

44 3In contrast, the ComplQ in the DA sequence Expl;ComplQ;Expl is generated using canned text such as ‘Why? [sent-75, score-0.09]

45 244 4 Evaluation We evaluate the output generated with both complex and direct rules for the relations of Table 1. [sent-78, score-0.151]

46 1 Materials, Judges and Procedure The input monologues were text excerpts from the Wall Street Journal as annotated in the RST Discourse Treebank4. [sent-80, score-0.078]

47 To factor out the quality of the discourse annotations, we used the gold standard annotations of the Discourse Treebank and checked these for correctness, discarding a small number of incorrect annotations. [sent-82, score-0.114]

48 5 We included text fragments with a variety of clause length, ordering of nucleus and satellite, and syntactic structure of clauses. [sent-83, score-0.044]

49 Table 2 shows examples of monologue/dialogue pairs: one with a generated dialogue and the other from the corpus. [sent-84, score-0.643]

50 We collected judgements on 53 pairs of monologue and corresponding dialogue. [sent-86, score-0.497]

51 19 pairs were judged by all four judges to obtain inter-annotator agreement statistics, the remainder was parcelled out. [sent-87, score-0.097]

52 38 pairs consisted of WSJ monologue and generated dialogue, henceforth GEN, and 15 pairs of CODA corpus monologue and human-authored dialogue, henceforth CORPUS (instances of generated and corpus dialogue were randomly interleaved) see Table 2 for examples. [sent-88, score-1.676]

53 html 5Fwowrw instance, dinu our vmieawr c‘wu/ithdoiust wondering’ oisr incorrectly connected with the attribution relation to ‘whether she is moving as gracefully as the scenery. [sent-94, score-0.046]

54 ’ GEN Monologue In September, Ashland settled the long-simmering dispute by agreeing to pay Iran 325 million USD. [sent-95, score-0.174]

55 A: B: Dialogue (ComplQ; Expl) How did Ashland settle the long-simmering dispute in December? [sent-96, score-0.097]

56 2 Results Accuracy Three of the four judges marked 90% of monologue-dialogue pairs as presenting the same information (with pairwise κ of . [sent-102, score-0.118]

57 One judge interpreted the question differently and marked only 39% of pairs as containing the same information. [sent-106, score-0.092]

58 For the instances marked by more than one judge, we took the majority vote. [sent-108, score-0.05]

59 We found that 12 out of 13 instances (or 92%) of dialogue and monologue pairs from the CORPUS benchmark sample were judged to contain the same information. [sent-109, score-1.203]

60 For the GEN monologuedialogue pairs, 28 out of 31 (90%) were judged to contain the same information. [sent-110, score-0.027]

61 Fluency Although absolute agreement between judges was low,6 pairwise agreement in terms of Spearman rank correlation (ρ) is reasonable (average: . [sent-111, score-0.093]

62 For the subset of in- stances with multiple annotations, we used the data from the judge with the highest average pair-wise agreement (ρ = . [sent-115, score-0.037]

63 Judges ranked both monologues and dialogues for 6For the four judges, we had an average pairwise κ of . [sent-117, score-0.25]

64 245 Figure 1: Mean Fluency Rating for Monologues and Dialogues (for 15 CORPUS and 38 GEN instances) with 95% confidence intervals the GEN sample higher than for the CORPUS sample (possibly as a result of slightly greater length of the CORPUS fragments and some use of archaic language). [sent-121, score-0.05]

65 However, the drop in fluency, see Figure 2, from monologue to dialogue is greater for GEN sample (average: . [sent-122, score-1.183]

66 89 points on the rating scale) than the CORPUS sample (average: . [sent-123, score-0.055]

67 05), suggesting that there is scope for improving the generation algorithm. [sent-125, score-0.067]

68 Figure 2: Fluency drop from monologue to corresponding dialogue (for 15 CORPUS and 38 GEN instances). [sent-126, score-1.158]

69 On the x-axis the fluency drop is marked, starting from no fluency drop (0) to a fluency drop of 3 (i. [sent-127, score-0.681]

70 , the dialogue is rated 3 points less than the monologue on the rating scale). [sent-129, score-1.131]

71 Direct versus Complex rules We examined the difference in fluency drop between direct and complex rules. [sent-130, score-0.3]

72 Figure 3 shows that the drop in fluency for dialogues generated with complex rules is higher than for the dialogues generated using direct rules (T-test p<. [sent-131, score-0.715]

73 This suggests that use of direct rules is more likely to result in high quality dialogue. [sent-133, score-0.073]

74 This is encouraging, given that Stoyanchev and Piwek (2010a) report higher frequencies in professionally authored dialogues of dialogue acts (YNQ, Expl) that can be dealt with using direct verbalisation (in contrast with low frequency of, e. [sent-134, score-0.954]

75 The system relies on discourse-todialogue structure rules that were automatically extracted from a parallel monologue/dialogue corpus. [sent-138, score-0.039]

76 An evaluation against a benchmark sample from the human-written corpus shows that both accuracy and fluency of generated dialogues are not worse than that of human-written dialogues. [sent-139, score-0.408]

77 However, drop in fluency between input monologue and output dialogue is slightly worse for generated dialogues than for the benchmark sample. [sent-140, score-1.541]

78 We also established a difference in quality of output generated with complex versus direct discourse-to-dialogue rules, which can 246 be exploited to improve overall output quality. [sent-141, score-0.073]

79 In future research, we aim to evaluate the accuracy and fluency of longer stretches of generated dialogue. [sent-142, score-0.209]

80 Additionally, we are currently carrying out a task-related evaluation of monologue versus dialogue to determine the utility of each. [sent-143, score-1.101]

81 The automated design of believable dialogues for animated presentation teams. [sent-157, score-0.196]

82 A novel discourse parser based on support vector machines. [sent-208, score-0.114]

83 A study to improve the efficiency of a discourse parsing system. [sent-222, score-0.114]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dialogue', 0.604), ('monologue', 0.497), ('piwek', 0.195), ('fluency', 0.17), ('dialogues', 0.149), ('coda', 0.136), ('expl', 0.136), ('gen', 0.121), ('stoyanchev', 0.12), ('complq', 0.117), ('discourse', 0.114), ('ashland', 0.078), ('factq', 0.078), ('monologues', 0.078), ('verbalisation', 0.078), ('mappings', 0.072), ('act', 0.07), ('judges', 0.07), ('expository', 0.069), ('prendinger', 0.069), ('generation', 0.067), ('dispute', 0.063), ('agents', 0.063), ('acts', 0.061), ('ynq', 0.058), ('drop', 0.057), ('satellite', 0.051), ('canned', 0.051), ('iran', 0.047), ('presentation', 0.047), ('da', 0.047), ('nucleus', 0.044), ('virtual', 0.043), ('agreeing', 0.042), ('pay', 0.042), ('uk', 0.04), ('relations', 0.039), ('rules', 0.039), ('bunt', 0.039), ('decl', 0.039), ('facta', 0.039), ('informationdelivering', 0.039), ('irect', 0.039), ('isard', 0.039), ('klesen', 0.039), ('mateas', 0.039), ('omplex', 0.039), ('procs', 0.039), ('generated', 0.039), ('judge', 0.037), ('direct', 0.034), ('cavazza', 0.034), ('walton', 0.034), ('duverle', 0.034), ('mellish', 0.034), ('settle', 0.034), ('transformation', 0.032), ('persuasion', 0.032), ('embodied', 0.032), ('keynes', 0.032), ('milton', 0.032), ('tregex', 0.032), ('deemter', 0.032), ('ishizuka', 0.032), ('sequences', 0.031), ('interactive', 0.03), ('rating', 0.03), ('question', 0.03), ('december', 0.03), ('graesser', 0.03), ('heilman', 0.03), ('education', 0.03), ('cmu', 0.029), ('rus', 0.028), ('authored', 0.028), ('entertainment', 0.028), ('copying', 0.028), ('judged', 0.027), ('settled', 0.027), ('tutoring', 0.026), ('carletta', 0.026), ('isi', 0.026), ('sample', 0.025), ('instances', 0.025), ('marked', 0.025), ('springer', 0.025), ('benchmark', 0.025), ('ii', 0.024), ('levy', 0.024), ('september', 0.024), ('relation', 0.023), ('mapping', 0.023), ('craig', 0.023), ('attribution', 0.023), ('carlson', 0.023), ('dale', 0.023), ('open', 0.023), ('pairwise', 0.023), ('declarative', 0.022), ('rhetorical', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

2 0.50073946 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

Author: Fabrizio Morbini ; Kenji Sagae

Abstract: Individual utterances often serve multiple communicative purposes in dialogue. We present a data-driven approach for identification of multiple dialogue acts in single utterances in the context of dialogue systems with limited training data. Our approach results in significantly increased understanding of user intent, compared to two strong baselines.

3 0.38452524 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

Author: Kristy Boyer ; Joseph Grafsgaard ; Eun Young Ha ; Robert Phillips ; James Lester

Abstract: Dialogue act classification is a central challenge for dialogue systems. Although the importance of emotion in human dialogue is widely recognized, most dialogue act classification models make limited or no use of affective channels in dialogue act classification. This paper presents a novel affect-enriched dialogue act classifier for task-oriented dialogue that models facial expressions of users, in particular, facial expressions related to confusion. The findings indicate that the affectenriched classifiers perform significantly better for distinguishing user requests for feedback and grounding dialogue acts within textual dialogue. The results point to ways in which dialogue systems can effectively leverage affective channels to improve dialogue act classification. 1

4 0.30777049 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System

Author: Wei-Bin Liang ; Chung-Hsien Wu ; Chia-Ping Chen

Abstract: In this study, a novel approach to robust dialogue act detection for error-prone speech recognition in a spoken dialogue system is proposed. First, partial sentence trees are proposed to represent a speech recognition output sentence. Semantic information and the derivation rules of the partial sentence trees are extracted and used to model the relationship between the dialogue acts and the derivation rules. The constructed model is then used to generate a semantic score for dialogue act detection given an input speech utterance. The proposed approach is implemented and evaluated in a Mandarin spoken dialogue system for tour-guiding service. Combined with scores derived from the ASR recognition probability and the dialogue history, the proposed approach achieves 84.3% detection accuracy, an absolute improvement of 34.7% over the baseline of the semantic slot-based method with 49.6% detection accuracy.

5 0.26142836 227 acl-2011-Multimodal Menu-based Dialogue with Speech Cursor in DICO II+

Author: Staffan Larsson ; Alexander Berman ; Jessica Villing

Abstract: Alexander Berman Jessica Villing Talkamatic AB University of Gothenburg Sweden Sweden alex@ t alkamat i . se c jessi ca@ l ing .gu . s e 2 In-vehicle dialogue systems This paper describes Dico II+, an in-vehicle dialogue system demonstrating a novel combination of flexible multimodal menu-based dialogueand a “speech cursor” which enables menu navigation as well as browsing long list using haptic input and spoken output.

6 0.17895639 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

7 0.15002757 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus

8 0.13057183 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

9 0.11380421 118 acl-2011-Entrainment in Speech Preceding Backchannels.

10 0.11165957 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

11 0.097903356 53 acl-2011-Automatically Evaluating Text Coherence Using Discourse Relations

12 0.096078739 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life

13 0.082131073 252 acl-2011-Prototyping virtual instructors from human-human corpora

14 0.079718515 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

15 0.056299273 267 acl-2011-Reversible Stochastic Attribute-Value Grammars

16 0.052370749 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System

17 0.049628861 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment

18 0.043690704 96 acl-2011-Disambiguating temporal-contrastive connectives for machine translation

19 0.043541104 101 acl-2011-Disentangling Chat with Local Coherence Models

20 0.042221565 47 acl-2011-Automatic Assessment of Coverage Quality in Intelligence Reports

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.113), (1, 0.045), (2, -0.051), (3, 0.026), (4, -0.389), (5, 0.421), (6, -0.1), (7, -0.034), (8, -0.023), (9, -0.001), (10, 0.161), (11, 0.019), (12, 0.114), (13, -0.065), (14, 0.058), (15, -0.012), (16, 0.02), (17, 0.016), (18, -0.027), (19, 0.068), (20, -0.1), (21, 0.036), (22, 0.086), (23, -0.164), (24, -0.0), (25, -0.0), (26, -0.138), (27, 0.046), (28, 0.093), (29, 0.009), (30, 0.013), (31, -0.004), (32, -0.073), (33, 0.013), (34, -0.02), (35, -0.024), (36, -0.05), (37, -0.004), (38, 0.024), (39, -0.003), (40, 0.018), (41, 0.02), (42, 0.019), (43, -0.031), (44, 0.001), (45, 0.007), (46, 0.028), (47, -0.003), (48, 0.006), (49, -0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97769386 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

2 0.96925628 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

Author: Fabrizio Morbini ; Kenji Sagae

3 0.94751102 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

Author: Kristy Boyer ; Joseph Grafsgaard ; Eun Young Ha ; Robert Phillips ; James Lester

4 0.93434983 227 acl-2011-Multimodal Menu-based Dialogue with Speech Cursor in DICO II+

Author: Staffan Larsson ; Alexander Berman ; Jessica Villing

5 0.80169374 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System

Author: Wei-Bin Liang ; Chung-Hsien Wu ; Chia-Ping Chen

6 0.62976521 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

7 0.56646079 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus

8 0.53184479 118 acl-2011-Entrainment in Speech Preceding Backchannels.

9 0.48431155 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life

10 0.4652462 252 acl-2011-Prototyping virtual instructors from human-human corpora

11 0.36565125 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

12 0.2739408 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

13 0.24331194 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

14 0.22744742 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

15 0.22648472 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System

16 0.20308679 53 acl-2011-Automatically Evaluating Text Coherence Using Discourse Relations

17 0.15693578 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

18 0.15172686 96 acl-2011-Disambiguating temporal-contrastive connectives for machine translation

19 0.14469917 291 acl-2011-SystemT: A Declarative Information Extraction System

20 0.14415911 101 acl-2011-Disentangling Chat with Local Coherence Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.036), (17, 0.032), (37, 0.042), (39, 0.036), (41, 0.058), (55, 0.015), (59, 0.044), (72, 0.46), (91, 0.035), (96, 0.116)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88730633 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya

Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.

same-paper 2 0.85519373 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

3 0.84146237 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

4 0.82723105 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

5 0.78094578 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

Author: Jinxi Xu ; Jinying Chen

Abstract: Word alignment is a central problem in statistical machine translation (SMT). In recent years, supervised alignment algorithms, which improve alignment accuracy by mimicking human alignment, have attracted a great deal of attention. The objective of this work is to explore the performance limit of supervised alignment under the current SMT paradigm. Our experiments used a manually aligned ChineseEnglish corpus with 280K words recently released by the Linguistic Data Consortium (LDC). We treated the human alignment as the oracle of supervised alignment. The result is surprising: the gain of human alignment over a state of the art unsupervised method (GIZA++) is less than 1point in BLEU. Furthermore, we showed the benefit of improved alignment becomes smaller with more training data, implying the above limit also holds for large training conditions. 1

6 0.76412243 252 acl-2011-Prototyping virtual instructors from human-human corpora

7 0.76263416 261 acl-2011-Recognizing Named Entities in Tweets

8 0.68252146 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

9 0.59526157 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

10 0.55681705 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11 0.52504182 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

12 0.52274996 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

13 0.51465249 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

14 0.51214916 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

15 0.51085567 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

16 0.49992117 141 acl-2011-Gappy Phrasal Alignment By Agreement

17 0.49980527 8 acl-2011-A Corpus of Scope-disambiguated English Text

18 0.49811924 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

19 0.49264669 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

20 0.49237227 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning