acl acl2011 acl2011-252 knowledge-graph by maker-knowledge-mining

252 acl-2011-Prototyping virtual instructors from human-human corpora

Source: pdf

Author: Luciana Benotti ; Alexandre Denis

Abstract: Virtual instructors can be used in several applications, ranging from trainers in simulated worlds to non player characters for virtual games. In this paper we present a novel algorithm for rapidly prototyping virtual instructors from human-human corpora without manual annotation. Automatically prototyping full-fledged dialogue systems from corpora is far from being a reality nowadays. Our algorithm is restricted in that only the virtual instructor can perform speech acts while the user responses are limited to physical actions in the virtual world. We evaluate a virtual instructor, generated using this algorithm, with human users. We compare our results both with human instructors and rule-based virtual instructors hand-coded for the same task.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Prototyping virtual instructors from human-human corpora Luciana Benotti PLN Group, FAMAF National University of C ´ordoba C ´ordoba, Argentina luciana . [sent-1, score-0.867]

2 i gmai l com @ Abstract Virtual instructors can be used in several applications, ranging from trainers in simulated worlds to non player characters for virtual games. [sent-3, score-0.935]

3 In this paper we present a novel algorithm for rapidly prototyping virtual instructors from human-human corpora without manual annotation. [sent-4, score-0.927]

4 Automatically prototyping full-fledged dialogue systems from corpora is far from being a reality nowadays. [sent-5, score-0.156]

5 Our algorithm is restricted in that only the virtual instructor can perform speech acts while the user responses are limited to physical actions in the virtual world. [sent-6, score-1.702]

6 We evaluate a virtual instructor, generated using this algorithm, with human users. [sent-7, score-0.623]

7 We compare our results both with human instructors and rule-based virtual instructors hand-coded for the same task. [sent-8, score-1.041]

8 1 Introduction Virtual human characters constitute a promising contribution to many fields, including simulation, training and interactive games (Kenny et al. [sent-9, score-0.08]

9 The ability to communicate using natural language is important for believable and effective virtual humans. [sent-12, score-0.623]

10 Nowadays, most conversational systems operate on a dialogue-act level and require extensive annotation efforts in order to be fit for their task (Rieser and Lemon, 2010). [sent-14, score-0.089]

11 Semantic annotation and rule authoring have long been known as bottlenecks for developing conversational systems for new domains. [sent-15, score-0.089]

12 In this paper, we present novel a algorithm for generating virtual instructors from automatically an62 Alexandre Denis TALARIS team, LORIA/CNRS Lorraine. [sent-16, score-0.832]

13 Our algorithm, when given a task-based corpus situated in a virtual world, generates an instructor that robustly helps a user achieve a given task in the virtual world of the corpus. [sent-20, score-1.708]

14 The selection approach to generation has only been used in conversational systems that are not task-oriented such as negotiating agents (Gandhe and Traum, 2007), question answering characters (Kenny et al. [sent-26, score-0.147]

15 Our algorithm can be seen as a novel way of doing robust generation by selection and interaction management for task-oriented systems. [sent-29, score-0.077]

16 Section 3 presents the two phases of our algorithm, namely automatic annotation and di- alogue management through selection. [sent-31, score-0.073]

17 In Section 4 we present a fragment of an interaction with a virtual instructor generated using the corpus and the algorithm introduced in the previous sections. [sent-32, score-0.931]

18 We evaluate the virtual instructor in interactions with human subjects using objective as well as subjective metrics. [sent-33, score-1.027]

19 We compare our results with both human and rule-based virtual instructors hand-coded for the same task. [sent-35, score-0.832]

20 Finally, Section 6 concludes the paper proposing an improved virtual instructor designed as a result of our error analysis. [sent-36, score-0.901]

21 (2010)) is a shared task in which Natural Language Generation systems must generate real-time instructions that guide a user in a virtual world. [sent-40, score-0.984]

22 , 2010), a corpus of human instruction giving in virtual environments. [sent-42, score-0.861]

23 We use the English part of the corpus which consists of 63 American English written discourses in which one subject guided another in a treasure hunting task in 3 different 3D worlds. [sent-43, score-0.073]

24 The “direction follower” (DF) moved about in the virtual world with the goal of completing a treasure hunting task, but had no knowledge of the map of the world or the specific behavior of objects within that world (such as, which buttons to press to open doors). [sent-45, score-0.924]

25 The other partner acted as the “direction giver” (DG), who was given complete knowledge of the world and had to give instructions to the DF to guide him/her to accomplish the task. [sent-46, score-0.367]

26 The GIVE-2 corpus is a multimodal corpus which consists of all the instructions uttered by the DG, and all the object manipulations done by the DF with the corresponding timestamp. [sent-47, score-0.308]

27 3 The unsupervised conversational model Our algorithm consists of two phases: an annotation phase and a selection phase. [sent-49, score-0.168]

28 The annotation phase is performed only once and consists of automatically associating the DG instruction to the DF reaction. [sent-50, score-0.316]

29 The selection phase is performed every time the virtual instructor generates an instruction and consists of picking out from the annotated corpus the most appropriate instruction at a given point. [sent-51, score-1.456]

30 1 The automatic annotation The basic idea of the annotation is straightforward: associate each utterance with its corresponding reaction. [sent-53, score-0.186]

31 We assume that a reaction captures the semantics of its associated instruction. [sent-54, score-0.14]

32 Defining reaction involves two subtle issues, namely boundary 63 determination and discretization. [sent-55, score-0.14]

33 We define the boundaries of a reaction as follows. [sent-57, score-0.14]

34 A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered. [sent-58, score-1.025]

35 In the following example, instruction 1corresponds to the reac- tion h2, 3, 4i, instruction 5 corresponds to h6i, and tiniosntru h2ct,io3n,4 7i ,to in h8i . [sent-59, score-0.476]

36 We discuss in Section 5 the impact that inappropriate associations have on the performance of a virtual instructor. [sent-66, score-0.623]

37 It is well known that there is not a unique way to discretize an action into sub- actions. [sent-68, score-0.073]

38 However, the same discretization mechanism used for annotation has to be used during selection, for the dialogue manager to work properly. [sent-71, score-0.177]

39 , in order to decide what to say next) any virtual instructor needs to have a planner and a planning domain representation, i. [sent-74, score-1.02]

40 , a specification of how the virtual world works and a way to represent the state of the virtual world. [sent-76, score-1.34]

41 Let Sk be the state of the virtual world when uttering instruction uk, Sk+1 be the state of the world when uttering the next utterance uk+1 and D be the planning domain representation. [sent-79, score-1.291]

42 The reaction to uk is defined as the sequence of actions returned by the planner with Sk as initial state, Sk+1 as goal state and D as planning domain. [sent-80, score-0.429]

43 The annotation of the corpus then consists of automatically associating each utterance to its (discretized) reaction. [sent-81, score-0.14]

44 2 Selecting what to say next In this section we describe how the selection phase is performed every time the virtual instructor generates an instruction. [sent-83, score-0.98]

45 The instruction selection algorithm consists in finding in the corpus the set of candidate utterances C for the current task plan P; P being the sequence of actions returned by the same planner and planning domain used for discretization. [sent-84, score-0.662]

46 first actions of the current plan P exactly match the reaction associated to the utterance. [sent-89, score-0.277]

47 All the utterances that pass this test are considered paraphrases and hence suitable in the current context. [sent-90, score-0.121]

48 While P does not change, the virtual instructor iterates through the set C, verbalizing a different utterance at fixed time intervals (e. [sent-91, score-0.995]

49 In other words, the virtual instructor offers alternative paraphrases of the intended instruction. [sent-94, score-0.934]

50 When P changes as a result of the actions of the DF, C is recalculated. [sent-95, score-0.088]

51 It is important to notice that the discretization used for annotation and selection directly impacts the behavior of the virtual instructor. [sent-96, score-0.786]

52 If the granularity is too coarse, many instructions in the corpus will have an empty associated reaction. [sent-98, score-0.314]

53 For instance, in the absence of the representation of the user orientation in the planning domain (as is the case for the virtual instructor we evaluate in Section 5), instructions like “turn left” and “turn right” will have empty reactions making them indistinguishable during selection. [sent-99, score-1.382]

54 However, if the granularity is too fine the user may get into situations that do not occur in the corpus, causing the selection algorithm to return an empty set of candidate utterances. [sent-100, score-0.208]

55 It is the responsibility of the virtual 64 instructor developer to find a granularity sufficient to capture the diversity of the instructions he wants to distinguish during selection. [sent-101, score-1.182]

56 4 A virtual instructor for a virtual world We implemented an English virtual instructor for one of the worlds used in the corpus collection we presented in Section 2. [sent-102, score-2.525]

57 2 instructions from the human DG, and took about 543 seconds on average for the human DF to complete the task. [sent-105, score-0.243]

58 On Figures 1 to 4 we show an excerpt of an interaction between the system and a real user that we collected during the evaluation. [sent-106, score-0.12]

59 The first candidate utterance selected is “red closest to the chair in front of you”. [sent-110, score-0.129]

60 Notice that the referring expression uniquely identifies the target object using the spatial proximity of the target to the chair. [sent-111, score-0.082]

61 This referring expression is generated without any reasoning on the target distractors, just by considering the current state of the task plan and the user position. [sent-112, score-0.25]

62 After receiving the instruction the user gets closer to the button as shown in Figure 2. [sent-113, score-0.45]

63 As a result of the new user position, a new task plan exists, the set of candidate utterances is recalculated and the system selects a new utterance, namely “the closet one”. [sent-114, score-0.335]

64 The generation of the ellipsis of the button or the chair is a direct consequence of the utterances normally said in the corpus at this stage of the task plan (that is, when the user is about to manipulate this object). [sent-115, score-0.416]

65 From the point of view of referring expression 65 algorithms, the referring expression may not be optimal because it is over-specified (a pronoun would be preferred as in “click it”), Furthermore, the instruction contains a spelling error (‘closet’ instead of ‘closest’). [sent-116, score-0.402]

66 In spite of this non optimality, the instruction led our user to execute the intended reaction, namely pushing the button. [sent-117, score-0.361]

67 Right after the user clicks on the button (Figure 3), the system selects an utterance corresponding to the new task plan. [sent-118, score-0.309]

68 The player position stayed the same so the only change in the plan is that the button no longer needs to be pushed. [sent-119, score-0.181]

69 In this task state, DGs usually give acknowledgements and this then what our selection algorithm selects: “good”. [sent-120, score-0.078]

70 After receiving the acknowledgement, the user turns around and walks forward, and the next action in the plan is to leave the room (Figure 4). [sent-121, score-0.24]

71 The system selects the utterance “exit the way you entered” which refers to the previous interaction. [sent-122, score-0.126]

72 Again, the system keeps no representation of the past actions of the user, but such utterances are the ones that are found at this stage of the task plan. [sent-123, score-0.209]

73 5 Evaluation and error analysis In this section we present the results of the evaluation we carried out on the virtual instructor presented in Section 4 which was generated using the dialogue model algorithm introduced in Section 3. [sent-124, score-0.962]

74 1 and subjective measures which we discuss in Section 5. [sent-129, score-0.095]

75 1 Objective metrics The objective metrics we extracted from the logs of interaction are summarized in Table 1. [sent-132, score-0.155]

76 The table compares our results with both human instructors and the three rule-based virtual instructors that were top rated in the GIVE-2 Challenge. [sent-133, score-1.075]

77 To ensure comparability, time until task completion, number of instructions received by users, and mouse actions are only counted on successfully completed games. [sent-140, score-0.385]

78 In particular, our system helped users to identify better the objects that they needed to manipulate in the virtual world, as shown by the low number of mouse actions required to complete the task (a high number indicates that the user must have manipulated wrong objects). [sent-145, score-0.956]

79 This correlates with the subjective evaluation of referring expression quality (see next section). [sent-146, score-0.177]

80 We performed a detailed analysis of the instructions uttered by our system that were unsuccessful, that is, all the instructions that did not cause the intended reaction as annotated in the corpus. [sent-147, score-0.724]

81 From the 2081 instructions uttered in the 13 interactions, 1304 (63%) of them were successful and 777 (37%) were unsuccessful. [sent-148, score-0.308]

82 1 (wrong annotation of correction utterances and no representation of user orientation) we classified the unsuccessful utterances using lexical cues into 1) correction (‘no’,‘don’t’,‘keep’ , etc. [sent-150, score-0.492]

83 We found that 25% of the unsuccessful utterances are of type 1, 40% are type 2, 34% are type 3 (1% corresponds to the default utterance “go” that our system utters when the set of candidate utterances is empty). [sent-153, score-0.45]

84 Frequently, these errors led to contradictions confusing the player and significantly affecting the completion time of the task as shown in Table 1. [sent-154, score-0.08]

85 In Section 6 we propose an improved virtual instructor designed as a result of this error analysis. [sent-155, score-0.901]

86 2 Subjective metrics The subjective measures were obtained from responses to the GIVE-2 questionnaire that was presented to users after each game. [sent-157, score-0.213]

87 1, we cannot compare against human instructors because these subjective metrics were not collected in (Gargett et al. [sent-162, score-0.351]

88 For almost all of these metrics we got similar or slightly lower results than those obtained by the three hand-coded systems, except for three metrics which we show in Table 2. [sent-166, score-0.094]

89 We suspect that the low results obtained for Q5 and Q22 relate to the unsuccessful utterances identified and discussed in Section 5. [sent-167, score-0.235]

90 The high unexpected result in Q6 is probably correlated with the low number of mouse actions mentioned in Section 5. [sent-169, score-0.142]

91 As Table 3 shows, in spite of the unsuccessful utterances, our system is rated as more natural and more engaging (in general) than the best systems that competed in the GIVE-2 Challenge. [sent-172, score-0.148]

92 Using our algorithm and the GIVE corpus we have gener- ated a virtual instructor1 for a game-like virtual environment. [sent-174, score-1.246]

93 We obtained encouraging results in the evaluation with human users that we did on the virtual instructor. [sent-175, score-0.663]

94 Our system outperforms rule-based virtual instructors hand-coded for the same task both in terms of objective and subjective metrics. [sent-176, score-0.958]

95 Our algorithm requires humanhuman corpora collected on the target task and environment, but it is independent of the particular instruction giving task. [sent-178, score-0.273]

96 ar/ ˜luciana/give-OUR 67 our discretization mechanism in order to take orientation into account. [sent-185, score-0.123]

97 Finally, if we could identify corrections automatically, as suggested in (Raux and Nakano, 2010), we could get another increase in performance, because we would be able to treat them as corrections and not as instructions as we do now. [sent-187, score-0.327]

98 In sum, this paper presents a novel way of automatically prototyping task-oriented virtual agents from corpora who are able to effectively and naturally help a user complete a task in a virtual world. [sent-188, score-1.459]

99 The GIVE-2 corpus of giving instructions in virtual environments. [sent-195, score-0.866]

100 Report on the second challenge on generating instructions in virtual environments (GIVE-2). [sent-219, score-0.893]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('virtual', 0.623), ('instructor', 0.278), ('instructions', 0.243), ('instruction', 0.238), ('instructors', 0.209), ('df', 0.156), ('reaction', 0.14), ('dg', 0.122), ('utterances', 0.121), ('unsuccessful', 0.114), ('subjective', 0.095), ('prototyping', 0.095), ('utterance', 0.094), ('button', 0.093), ('user', 0.09), ('actions', 0.088), ('saar', 0.086), ('koller', 0.076), ('gargett', 0.076), ('discretization', 0.07), ('leuski', 0.065), ('uttered', 0.065), ('world', 0.065), ('planning', 0.062), ('dialogue', 0.061), ('sk', 0.057), ('planner', 0.057), ('kenny', 0.057), ('mouse', 0.054), ('orientation', 0.053), ('uk', 0.053), ('anton', 0.052), ('games', 0.051), ('plan', 0.049), ('referring', 0.049), ('selection', 0.047), ('metrics', 0.047), ('annotation', 0.046), ('closet', 0.043), ('ordoba', 0.043), ('uttering', 0.043), ('conversational', 0.043), ('corrections', 0.042), ('completion', 0.041), ('users', 0.04), ('na', 0.039), ('player', 0.039), ('granularity', 0.038), ('hunting', 0.038), ('rieser', 0.038), ('striegnitz', 0.038), ('gandhe', 0.038), ('discretize', 0.038), ('famaf', 0.038), ('raux', 0.038), ('room', 0.037), ('nm', 0.037), ('action', 0.035), ('humanhuman', 0.035), ('cassell', 0.035), ('justine', 0.035), ('chair', 0.035), ('treasure', 0.035), ('byron', 0.035), ('donna', 0.035), ('johanna', 0.035), ('luciana', 0.035), ('worlds', 0.035), ('rated', 0.034), ('green', 0.034), ('objects', 0.033), ('naturalness', 0.033), ('traum', 0.033), ('empty', 0.033), ('intended', 0.033), ('expression', 0.033), ('selects', 0.032), ('phase', 0.032), ('questionnaire', 0.031), ('door', 0.031), ('sigdial', 0.031), ('entered', 0.031), ('iva', 0.031), ('engagement', 0.031), ('objective', 0.031), ('give', 0.031), ('interaction', 0.03), ('pushes', 0.03), ('red', 0.029), ('kristina', 0.029), ('state', 0.029), ('characters', 0.029), ('receiving', 0.029), ('situated', 0.029), ('agents', 0.028), ('manipulate', 0.028), ('skills', 0.028), ('guide', 0.028), ('environments', 0.027), ('phases', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 252 acl-2011-Prototyping virtual instructors from human-human corpora

Author: Luciana Benotti ; Alexandre Denis

2 0.17357001 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

Author: Nina Dethlefs ; Heriberto Cuayahuitl

Abstract: Surface realisation decisions in language generation can be sensitive to a language model, but also to decisions of content selection. We therefore propose the joint optimisation of content selection and surface realisation using Hierarchical Reinforcement Learning (HRL). To this end, we suggest a novel reward function that is induced from human data and is especially suited for surface realisation. It is based on a generation space in the form of a Hidden Markov Model (HMM). Results in terms of task success and human-likeness suggest that our unified approach performs better than greedy or random baselines.

3 0.13734722 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

Author: Fabrizio Morbini ; Kenji Sagae

Abstract: Individual utterances often serve multiple communicative purposes in dialogue. We present a data-driven approach for identification of multiple dialogue acts in single utterances in the context of dialogue systems with limited training data. Our approach results in significantly increased understanding of user intent, compared to two strong baselines.

4 0.11607717 180 acl-2011-Issues Concerning Decoding with Synchronous Context-free Grammar

Author: Tagyoung Chung ; Licheng Fang ; Daniel Gildea

Abstract: We discuss some of the practical issues that arise from decoding with general synchronous context-free grammars. We examine problems caused by unary rules and we also examine how virtual nonterminals resulting from binarization can best be handled. We also investigate adding more flexibility to synchronous context-free grammars by adding glue rules and phrases.

5 0.099541582 296 acl-2011-Terminal-Aware Synchronous Binarization

Author: Licheng Fang ; Tagyoung Chung ; Daniel Gildea

Abstract: We present an SCFG binarization algorithm that combines the strengths of early terminal matching on the source language side and early language model integration on the target language side. We also examine how different strategies of target-side terminal attachment during binarization can significantly affect translation quality.

6 0.091409683 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life

7 0.082936034 205 acl-2011-Learning to Grade Short Answer Questions using Semantic Similarity Measures and Dependency Graph Alignments

8 0.082131073 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

9 0.078595117 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

10 0.07680317 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

11 0.070007242 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

12 0.068729594 227 acl-2011-Multimodal Menu-based Dialogue with Speech Cursor in DICO II+

13 0.062751152 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

14 0.062443957 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

15 0.054830119 177 acl-2011-Interactive Group Suggesting for Twitter

16 0.053428695 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations

17 0.049551345 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System

18 0.046757676 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations

19 0.043546952 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

20 0.042219486 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.106), (1, 0.028), (2, -0.012), (3, 0.006), (4, -0.154), (5, 0.135), (6, -0.032), (7, -0.043), (8, -0.015), (9, -0.039), (10, -0.001), (11, -0.009), (12, 0.044), (13, 0.014), (14, 0.0), (15, 0.004), (16, 0.019), (17, -0.009), (18, -0.012), (19, 0.014), (20, 0.036), (21, 0.047), (22, -0.033), (23, -0.011), (24, -0.041), (25, -0.057), (26, -0.029), (27, 0.015), (28, 0.007), (29, -0.082), (30, 0.039), (31, 0.082), (32, 0.001), (33, 0.065), (34, 0.071), (35, 0.13), (36, 0.028), (37, 0.095), (38, 0.061), (39, 0.051), (40, 0.055), (41, -0.053), (42, -0.044), (43, 0.039), (44, -0.001), (45, -0.031), (46, -0.039), (47, 0.067), (48, -0.082), (49, 0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94320923 252 acl-2011-Prototyping virtual instructors from human-human corpora

Author: Luciana Benotti ; Alexandre Denis

2 0.70136267 149 acl-2011-Hierarchical Reinforcement Learning and Hidden Markov Models for Task-Oriented Natural Language Generation

Author: Nina Dethlefs ; Heriberto Cuayahuitl

3 0.57432783 227 acl-2011-Multimodal Menu-based Dialogue with Speech Cursor in DICO II+

Author: Staffan Larsson ; Alexander Berman ; Jessica Villing

Abstract: Alexander Berman Jessica Villing Talkamatic AB University of Gothenburg Sweden Sweden alex@ t alkamat i . se c jessi ca@ l ing .gu . s e 2 In-vehicle dialogue systems This paper describes Dico II+, an in-vehicle dialogue system demonstrating a novel combination of flexible multimodal menu-based dialogueand a “speech cursor” which enables menu navigation as well as browsing long list using haptic input and spoken output.

4 0.54686487 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue

Author: Kristy Boyer ; Joseph Grafsgaard ; Eun Young Ha ; Robert Phillips ; James Lester

Abstract: Dialogue act classification is a central challenge for dialogue systems. Although the importance of emotion in human dialogue is widely recognized, most dialogue act classification models make limited or no use of affective channels in dialogue act classification. This paper presents a novel affect-enriched dialogue act classifier for task-oriented dialogue that models facial expressions of users, in particular, facial expressions related to confusion. The findings indicate that the affectenriched classifiers perform significantly better for distinguishing user requests for feedback and grounding dialogue acts within textual dialogue. The results point to ways in which dialogue systems can effectively leverage affective channels to improve dialogue act classification. 1

5 0.52673382 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

Abstract: This short paper introduces an implemented and evaluated monolingual Text-to-Text generation system. The system takes monologue and transforms it to two-participant dialogue. After briefly motivating the task of monologue-to-dialogue generation, we describe the system and present an evaluation in terms of fluency and accuracy.

6 0.514862 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP

7 0.50432074 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

8 0.50186235 226 acl-2011-Multi-Modal Annotation of Quest Games in Second Life

9 0.4949244 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?

10 0.49286449 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

11 0.49152771 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

12 0.45791027 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

13 0.45676768 296 acl-2011-Terminal-Aware Synchronous Binarization

14 0.4502649 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System

15 0.44382036 317 acl-2011-Underspecifying and Predicting Voice for Surface Realisation Ranking

16 0.42274246 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus

17 0.41060096 74 acl-2011-Combining Indicators of Allophony

18 0.4083668 118 acl-2011-Entrainment in Speech Preceding Backchannels.

19 0.40737331 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

20 0.40281445 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.04), (17, 0.065), (26, 0.038), (37, 0.044), (39, 0.043), (41, 0.074), (54, 0.043), (55, 0.015), (59, 0.04), (72, 0.308), (78, 0.023), (91, 0.025), (96, 0.114), (97, 0.016), (98, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90770257 302 acl-2011-They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems

Author: Nitin Madnani ; Martin Chodorow ; Joel Tetreault ; Alla Rozovskaya

Abstract: Despite the rising interest in developing grammatical error detection systems for non-native speakers of English, progress in the field has been hampered by a lack of informative metrics and an inability to directly compare the performance of systems developed by different researchers. In this paper we address these problems by presenting two evaluation methodologies, both based on a novel use of crowdsourcing. 1 Motivation and Contributions One of the fastest growing areas in need of NLP tools is the field of grammatical error detection for learners of English as a Second Language (ESL). According to Guo and Beckett (2007), “over a billion people speak English as their second or for- eign language.” This high demand has resulted in many NLP research papers on the topic, a Synthesis Series book (Leacock et al., 2010) and a recurring workshop (Tetreault et al., 2010a), all in the last five years. In this year’s ACL conference, there are four long papers devoted to this topic. Despite the growing interest, two major factors encumber the growth of this subfield. First, the lack of consistent and appropriate score reporting is an issue. Most work reports results in the form of precision and recall as measured against the judgment of a single human rater. This is problematic because most usage errors (such as those in article and preposition usage) are a matter of degree rather than simple rule violations such as number agreement. As a consequence, it is common for two native speakers 508 to have different judgments of usage. Therefore, an appropriate evaluation should take this into account by not only enlisting multiple human judges but also aggregating these judgments in a graded manner. Second, systems are hardly ever compared to each other. In fact, to our knowledge, no two systems developed by different groups have been compared directly within the field primarily because there is no common corpus or shared task—both commonly found in other NLP areas such as machine translation.1 For example, Tetreault and Chodorow (2008), Gamon et al. (2008) and Felice and Pulman (2008) developed preposition error detection systems, but evaluated on three different corpora using different evaluation measures. The goal of this paper is to address the above issues by using crowdsourcing, which has been proven effective for collecting multiple, reliable judgments in other NLP tasks: machine translation (Callison-Burch, 2009; Zaidan and CallisonBurch, 2010), speech recognition (Evanini et al., 2010; Novotney and Callison-Burch, 2010), automated paraphrase generation (Madnani, 2010), anaphora resolution (Chamberlain et al., 2009), word sense disambiguation (Akkaya et al., 2010), lexicon construction for less commonly taught languages (Irvine and Klementiev, 2010), fact mining (Wang and Callison-Burch, 2010) and named entity recognition (Finin et al., 2010) among several others. In particular, we make a significant contribution to the field by showing how to leverage crowdsourc1There has been a recent proposal for a related shared task (Dale and Kilgarriff, 2010) that shows promise. Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 508–513, ing to both address the lack ofappropriate evaluation metrics and to make system comparison easier. Our solution is general enough for, in the simplest case, intrinsically evaluating a single system on a single dataset and, more realistically, comparing two different systems (from same or different groups). 2 A Case Study: Extraneous Prepositions We consider the problem of detecting an extraneous preposition error, i.e., incorrectly using a preposition where none is licensed. In the sentence “They came to outside”, the preposition to is an extraneous error whereas in the sentence “They arrived to the town” the preposition to is a confusion error (cf. arrived in the town). Most work on automated correction of preposition errors, with the exception of Gamon (2010), addresses preposition confusion errors e.g., (Felice and Pulman, 2008; Tetreault and Chodorow, 2008; Rozovskaya and Roth, 2010b). One reason is that in addition to the standard context-based features used to detect confusion errors, identifying extraneous prepositions also requires actual knowledge of when a preposition can and cannot be used. Despite this lack of attention, extraneous prepositions account for a significant proportion—as much as 18% in essays by advanced English learners (Rozovskaya and Roth, 2010a)—of all preposition usage errors. 2.1 Data and Systems For the experiments in this paper, we chose a proprietary corpus of about 500,000 essays written by ESL students for Test of English as a Foreign Language (TOEFL?R). Despite being common ESL errors, preposition errors are still infrequent overall, with over 90% of prepositions being used correctly (Leacock et al., 2010; Rozovskaya and Roth, 2010a). Given this fact about error sparsity, we needed an efficient method to extract a good number of error instances (for statistical reliability) from the large essay corpus. We found all trigrams in our essays containing prepositions as the middle word (e.g., marry with her) and then looked up the counts of each tri- gram and the corresponding bigram with the preposition removed (marry her) in the Google Web1T 5-gram Corpus. If the trigram was unattested or had a count much lower than expected based on the bi509 gram count, then we manually inspected the trigram to see whether it was actually an error. If it was, we extracted a sentence from the large essay corpus containing this erroneous trigram. Once we had extracted 500 sentences containing extraneous preposition error instances, we added 500 sentences containing correct instances of preposition usage. This yielded a corpus of 1000 sentences with a 50% error rate. These sentences, with the target preposition highlighted, were presented to 3 expert annotators who are native English speakers. They were asked to annotate the preposition usage instance as one of the following: extraneous (Error), not extraneous (OK) or too hard to decide (Unknown); the last category was needed for cases where the context was too messy to make a decision about the highlighted preposition. On average, the three experts had an agreement of 0.87 and a kappa of 0.75. For subse- quent analysis, we only use the classes Error and OK since Unknown was used extremely rarely and never by all 3 experts for the same sentence. We used two different error detection systems to illustrate our evaluation methodology:2 • • 3 LM: A 4-gram language model trained on tLhMe Google Wme lba1nTg 5-gram Corpus dw oithn SRILM (Stolcke, 2002). PERC: An averaged Perceptron (Freund and Schapire, 1999) calgaessdif Pieerr—ce as implemented nind the Learning by Java toolkit (Rizzolo and Roth, 2007)—trained on 7 million examples and using the same features employed by Tetreault and Chodorow (2008). Crowdsourcing Recently,we showed that Amazon Mechanical Turk (AMT) is a cheap and effective alternative to expert raters for annotating preposition errors (Tetreault et al., 2010b). In other current work, we have extended this pilot study to show that CrowdFlower, a crowdsourcing service that allows for stronger quality con- × trol on untrained human raters (henceforth, Turkers), is more reliable than AMT on three different error detection tasks (article errors, confused prepositions 2Any conclusions drawn in this paper pertain only to these specific instantiations of the two systems. & extraneous prepositions). To impose such quality control, one has to provide “gold” instances, i.e., examples with known correct judgments that are then used to root out any Turkers with low performance on these instances. For all three tasks, we obtained 20 Turkers’ judgments via CrowdFlower for each instance and found that, on average, only 3 Turkers were required to match the experts. More specifically, for the extraneous preposition error task, we used 75 sentences as gold and obtained judgments for the remaining 923 non-gold sentences.3 We found that if we used 3 Turker judgments in a majority vote, the agreement with any one of the three expert raters is, on average, 0.87 with a kappa of 0.76. This is on par with the inter-expert agreement and kappa found earlier (0.87 and 0.75 respectively). The extraneous preposition annotation cost only $325 (923 judgments 20 Turkers) and was com- pleted 9in2 a single day. T 2h0e only rres)st arnicdtio wna on tmheTurkers was that they be physically located in the USA. For the analysis in subsequent sections, we use these 923 sentences and the respective 20 judgments obtained via CrowdFlower. The 3 expert judgments are not used any further in this analysis. 4 Revamping System Evaluation In this section, we provide details on how crowdsourcing can help revamp the evaluation of error detection systems: (a) by providing more informative measures for the intrinsic evaluation of a single system (§ 4. 1), and (b) by easily enabling system comparison (§ 4.2). 4.1 Crowd-informed Evaluation Measures When evaluating the performance of grammatical error detection systems against human judgments, the judgments for each instance are generally reduced to the single most frequent category: Error or OK. This reduction is not an accurate reflection of a complex phenomenon. It discards valuable information about the acceptability of usage because it treats all “bad” uses as equal (and all good ones as equal), when they are not. Arguably, it would be fairer to use a continuous scale, such as the proportion of raters who judge an instance as correct or 3We found 2 duplicate sentences and removed them. 510 incorrect. For example, if 90% of raters agree on a rating of Error for an instance of preposition usage, then that is stronger evidence that the usage is an error than if 56% of Turkers classified it as Error and 44% classified it as OK (the sentence “In addition classmates play with some game and enjoy” is an example). The regular measures of precision and recall would be fairer if they reflected this reality. Besides fairness, another reason to use a continuous scale is that of stability, particularly with a small number of instances in the evaluation set (quite common in the field). By relying on majority judgments, precision and recall measures tend to be unstable (see below). We modify the measures of precision and recall to incorporate distributions of correctness, obtained via crowdsourcing, in order to make them fairer and more stable indicators of system performance. Given an error detection system that classifies a sentence containing a specific preposition as Error (class 1) if the preposition is extraneous and OK (class 0) otherwise, we propose the following weighted versions of hits (Hw), misses (Mw) and false positives (FPw): XN Hw = X(csiys ∗ picrowd) (1) Xi XN Mw = X((1 − csiys) ∗ picrowd) (2) Xi XN FPw = X(csiys ∗ (1 − picrowd)) (3) Xi In the above equations, N is the total number of instances, csiys is the class (1 or 0) , and picrowd indicates the proportion of the crowd that classified instance i as Error. Note that if we were to revert to the majority crowd judgment as the sole judgment for each instance, instead of proportions, picrowd would always be either 1 or 0 and the above formulae would simply compute the normal hits, misses and false positives. Given these definitions, weighted precision can be defined as Precisionw = Hw/(Hw Hw/(Hw + FPw) and weighted + Mw). recall as Recallw = agreement Figure 1: Histogram of Turker agreements for all 923 instances on whether a preposition is extraneous. UWnwei gihg tede Pr0 e.c9 i5s0i70onR0 .e3 c78al14l Table 1: Comparing commonly used (unweighted) and proposed (weighted) precision/recall measures for LM. To illustrate the utility of these weighted measures, we evaluated the LM and PERC systems on the dataset containing 923 preposition instances, against all 20 Turker judgments. Figure 1 shows a histogram of the Turker agreement for the majority rating over the set. Table 1 shows both the unweighted (discrete majority judgment) and weighted (continuous Turker proportion) versions of precision and recall for this system. The numbers clearly show that in the unweighted case, the performance of the system is overestimated simply because the system is getting as much credit for each contentious case (low agreement) as for each clear one (high agreement). In the weighted measure we propose, the contentious cases are weighted lower and therefore their contribution to the overall performance is reduced. This is a fairer representation since the system should not be expected to perform as well on the less reliable instances as it does on the clear-cut instances. Essentially, if humans cannot consistently decide whether 511 [n=93] [n=1 14] Agreement Bin [n=71 6] Figure 2: Unweighted precision/recall by agreement bins for LM & PERC. a case is an error then a system’s output cannot be considered entirely right or entirely wrong.4 As an added advantage, the weighted measures are more stable. Consider a contentious instance in a small dataset where 7 out of 15 Turkers (a minority) classified it as Error. However, it might easily have happened that 8 Turkers (a majority) classified it as Error instead of 7. In that case, the change in unweighted precision would have been much larger than is warranted by such a small change in the data. However, weighted precision is guaranteed to be more stable. Note that the instability decreases as the size of the dataset increases but still remains a problem. 4.2 Enabling System Comparison In this section, we show how to easily compare different systems both on the same data (in the ideal case of a shared dataset being available) and, more realistically, on different datasets. Figure 2 shows (unweighted) precision and recall of LM and PERC (computed against the majority Turker judgment) for three agreement bins, where each bin is defined as containing only the instances with Turker agreement in a specific range. We chose the bins shown 4The difference between unweighted and weighted measures can vary depending on the distribution of agreement. since they are sufficiently large and represent a reasonable stratification of the agreement space. Note that we are not weighting the precision and recall in this case since we have already used the agreement proportions to create the bins. This curve enables us to compare the two systems easily on different levels of item contentiousness and, therefore, conveys much more information than what is usually reported (a single number for unweighted precision/recall over the whole corpus). For example, from this graph, PERC is seen to have similar performance as LM for the 75-90% agreement bin. In addition, even though LM precision is perfect (1.0) for the most contentious instances (the 50-75% bin), this turns out to be an artifact of the LM classifier’s decision process. When it must decide between what it views as two equally likely possibilities, it defaults to OK. Therefore, even though LM has higher unweighted precision (0.957) than PERC (0.813), it is only really better on the most clear-cut cases (the 90-100% bin). If one were to report unweighted precision and recall without using any bins—as is the norm—this important qualification would have been harder to discover. While this example uses the same dataset for evaluating two systems, the procedure is general enough to allow two systems to be compared on two different datasets by simply examining the two plots. However, two potential issues arise in that case. The first is that the bin sizes will likely vary across the two plots. However, this should not be a significant problem as long as the bins are sufficiently large. A second, more serious, issue is that the error rates (the proportion of instances that are actually erroneous) in each bin may be different across the two plots. To handle this, we recommend that a kappa-agreement plot be used instead of the precision-agreement plot shown here. 5 Conclusions Our goal is to propose best practices to address the two primary problems in evaluating grammatical error detection systems and we do so by leveraging crowdsourcing. For system development, we rec- ommend that rather than compressing multiple judgments down to the majority, it is better to use agreement proportions to weight precision and recall to 512 yield fairer and more stable indicators of performance. For system comparison, we argue that the best solution is to use a shared dataset and present the precision-agreement plot using a set of agreed-upon bins (possibly in conjunction with the weighted precision and recall measures) for a more informative comparison. However, we recognize that shared datasets are harder to create in this field (as most of the data is proprietary). Therefore, we also provide a way to compare multiple systems across different datasets by using kappa-agreement plots. As for agreement bins, we posit that the agreement values used to define them depend on the task and, therefore, should be determined by the community. Note that both of these practices can also be implemented by using 20 experts instead of 20 Turkers. However, we show that crowdsourcing yields judgments that are as good but without the cost. To facilitate the adoption of these practices, we make all our evaluation code and data available to the com- munity.5 Acknowledgments We would first like to thank our expert annotators Sarah Ohls and Waverely VanWinkle for their hours of hard work. We would also like to acknowledge Lei Chen, Keelan Evanini, Jennifer Foster, Derrick Higgins and the three anonymous reviewers for their helpful comments and feedback. References Cem Akkaya, Alexander Conrad, Janyce Wiebe, and Rada Mihalcea. 2010. Amazon Mechanical Turk for Subjectivity Word Sense Disambiguation. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 195–203. Chris Callison-Burch. 2009. Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk. In Proceedings of EMNLP, pages 286– 295. Jon Chamberlain, Massimo Poesio, and Udo Kruschwitz. 2009. A Demonstration of Human Computation Using the Phrase Detectives Annotation Game. In ACM SIGKDD Workshop on Human Computation, pages 23–24. 5http : / /bit . ly/ crowdgrammar Robert Dale and Adam Kilgarriff. 2010. Helping Our Own: Text Massaging for Computational Linguistics as a New Shared Task. In Proceedings of INLG. Keelan Evanini, Derrick Higgins, and Klaus Zechner. 2010. Using Amazon Mechanical Turk for Transcription of Non-Native Speech. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 53–56. Rachele De Felice and Stephen Pulman. 2008. A Classifier-Based Approach to Preposition and Determiner Error Correction in L2 English. In Proceedings of COLING, pages 169–176. Tim Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin Martineau, and Mark Dredze. 2010. Annotating Named Entities in Twitter Data with Crowdsourcing. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 80–88. Yoav Freund and Robert E. Schapire. 1999. Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3):277–296. Michael Gamon, Jianfeng Gao, Chris Brockett, Alexander Klementiev, William Dolan, Dmitriy Belenko, and Lucy Vanderwende. 2008. Using Contextual Speller Techniques and Language Modeling for ESL Error Correction. In Proceedings of IJCNLP. Michael Gamon. 2010. Using Mostly Native Data to Correct Errors in Learners’ Writing. In Proceedings of NAACL, pages 163–171 . Y. Guo and Gulbahar Beckett. 2007. The Hegemony of English as a Global Language: Reclaiming Local Knowledge and Culture in China. Convergence: International Journal of Adult Education, 1. Ann Irvine and Alexandre Klementiev. 2010. Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 108–1 13. Claudia Leacock, Martin Chodorow, Michael Gamon, and Joel Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Synthesis Lectures on Human Language Technologies. Morgan Claypool. Nitin Madnani. 2010. The Circle of Meaning: From Translation to Paraphrasing and Back. Ph.D. thesis, Department of Computer Science, University of Maryland College Park. Scott Novotney and Chris Callison-Burch. 2010. Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription. In Proceedings of NAACL, pages 207–215. Nicholas Rizzolo and Dan Roth. 2007. Modeling Discriminative Global Inference. In Proceedings of 513 the First IEEE International Conference on Semantic Computing (ICSC), pages 597–604, Irvine, California, September. Alla Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Alla Rozovskaya and D. Roth. 2010b. Generating Confusion Sets for Context-Sensitive Error Correction. In Proceedings of EMNLP. Andreas Stolcke. 2002. SRILM: An Extensible Language Modeling Toolkit. In Proceedings of the International Conference on Spoken Language Processing, pages 257–286. Joel Tetreault and Martin Chodorow. 2008. The Ups and Downs of Preposition Error Detection in ESL Writing. In Proceedings of COLING, pages 865–872. Joel Tetreault, Jill Burstein, and Claudia Leacock, editors. 2010a. Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Joel Tetreault, Elena Filatova, and Martin Chodorow. 2010b. Rethinking Grammatical Error Annotation and Evaluation with the Amazon Mechanical Turk. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications, pages 45–48. Rui Wang and Chris Callison-Burch. 2010. Cheap Facts and Counter-Facts. In Proceedings of the NAACL Workshop on Creating Speech and Language Data with Amazon ’s Mechanical Turk, pages 163–167. Omar F. Zaidan and Chris Callison-Burch. 2010. Predicting Human-Targeted Translation Edit Rate via Untrained Human Annotators. In Proceedings of NAACL, pages 369–372.

2 0.90283006 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation

Author: Paul Piwek ; Svetlana Stoyanchev

3 0.8877632 130 acl-2011-Extracting Comparative Entities and Predicates from Texts Using Comparative Type Classification

Author: Seon Yang ; Youngjoong Ko

Abstract: The automatic extraction of comparative information is an important text mining problem and an area of increasing interest. In this paper, we study how to build a Korean comparison mining system. Our work is composed of two consecutive tasks: 1) classifying comparative sentences into different types and 2) mining comparative entities and predicates. We perform various experiments to find relevant features and learning techniques. As a result, we achieve outstanding performance enough for practical use. 1

4 0.85837758 142 acl-2011-Generalized Interpolation in Decision Tree LM

Author: Denis Filimonov ; Mary Harper

Abstract: In the face of sparsity, statistical models are often interpolated with lower order (backoff) models, particularly in Language Modeling. In this paper, we argue that there is a relation between the higher order and the backoff model that must be satisfied in order for the interpolation to be effective. We show that in n-gram models, the relation is trivially held, but in models that allow arbitrary clustering of context (such as decision tree models), this relation is generally not satisfied. Based on this insight, we also propose a generalization of linear interpolation which significantly improves the performance of a decision tree language model.

same-paper 5 0.85456508 252 acl-2011-Prototyping virtual instructors from human-human corpora

Author: Luciana Benotti ; Alexandre Denis

6 0.8294189 261 acl-2011-Recognizing Named Entities in Tweets

7 0.82347769 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

8 0.77904493 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

9 0.70476925 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

10 0.66677088 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11 0.63442492 160 acl-2011-Identifying Sarcasm in Twitter: A Closer Look

12 0.62769985 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

13 0.62393516 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks

14 0.62165123 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents

15 0.62043154 141 acl-2011-Gappy Phrasal Alignment By Agreement

16 0.61819607 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters

17 0.61706614 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

18 0.61274999 8 acl-2011-A Corpus of Scope-disambiguated English Text

19 0.60844517 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

20 0.60703737 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition