acl acl2012 acl2012-24 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Srinivasan Janarthanam ; Oliver Lemon ; Xingkun Liu
Abstract: We demonstrate a web-based environment for development and testing of different pedestrian route instruction-giving systems. The environment contains a City Model, a TTS interface, a game-world, and a user GUI including a simulated street-view. We describe the environment and components, the metrics that can be used for the evaluation of pedestrian route instruction-giving systems, and the shared challenge which is being organised using this environment.
Reference: text
sentIndex sentText sentNum sentScore
1 uk iu@ Abstract We demonstrate a web-based environment for development and testing of different pedestrian route instruction-giving systems. [sent-5, score-0.466]
2 The environment contains a City Model, a TTS interface, a game-world, and a user GUI including a simulated street-view. [sent-6, score-0.442]
3 We describe the environment and components, the metrics that can be used for the evaluation of pedestrian route instruction-giving systems, and the shared challenge which is being organised using this environment. [sent-7, score-0.532]
4 1 Introduction Generating navigation instructions in the real world for pedestrians is an interesting research problem for researchers in both computational linguistics and geo-informatics (Dale et al. [sent-8, score-0.898]
5 These systems generate verbal route directions for users to go from A to B, and techniques range from giving ‘a priori’ route directions (i. [sent-10, score-0.503]
6 all route information in a single turn) and incremental ‘in-situ’ instructions, to full interactive dialogue systems (see section 4). [sent-12, score-0.314]
7 One of the major problems in developing such systems is in evaluating them with real users in the real world. [sent-13, score-0.295]
8 Consequently, there is a need for a common platform to effectively compare the performances of verbal navigation systems developed by different teams using a variety of techniques (e. [sent-15, score-0.783]
9 49 This demonstration system brings together existing online data resources and software toolkits to create a low-cost framework for evaluation of pedestrian route instruction systems. [sent-20, score-0.532]
10 We have built a web-based environment containing a simulated real world in which users can simulate walking on the streets of real cities whilst interacting with differ- ent navigation systems. [sent-21, score-1.361]
11 2 Related work The GIVE challenge developed a 3D virtual indoor environment for development and evaluation of indoor pedestrian navigation instruction systems (Koller et al. [sent-23, score-1.414]
12 The user is instructed by a navigation system that generates route instructions. [sent-27, score-1.077]
13 The basic idea was to have several such navigation systems hosted on the GIVE server and evaluate them in the same game worlds, with a number of users over the internet. [sent-28, score-1.146]
14 Conceptually our work is very similar to the GIVE framework, but its objective is to evaluate systems that instruct pedestrian users in the real world. [sent-29, score-0.429]
15 The GIVE framework has been successfully used for comparative evaluation of several systems generating instructions in virtual indoor environments. [sent-30, score-0.364]
16 Another system, “Virtual Navigator”, is a simulated 3D environment that simulates the real world for training blind and visually impaired people to learn often-used routes and develop basic navigation skills (McGookin et al. [sent-31, score-1.067]
17 c s 2o0c1ia2ti Aosns fo cria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsteiscs 49–54, uses haptic force-feedback and spatialised auditory feedback to simulate the interaction between users and the environment they are in. [sent-35, score-0.457]
18 The users simulate walking by using arrow keys on a keyboard and by using a device that works as a 3D mouse to simulate a virtual white cane. [sent-36, score-0.459]
19 Auditory clues are provided to the cane user to indicate for example the difference between rush hour and a quiet evening in the environment. [sent-37, score-0.25]
20 While this simulated environment focusses on the providing the right kind of tactile and auditory feedback to its users, we focus on providing a simulated environment where people can look at landmarks and navigate based on spatial and visual instructions provided to them. [sent-38, score-0.739]
21 User simulation modules are usually developed to train and test reinforcement learning based interactive spoken dialogue systems (Janarthanam and Lemon, 2009; Georgila et al. [sent-39, score-0.268]
22 These agents replace real users in interac- tion with dialogue systems. [sent-42, score-0.289]
23 In contrast to this approach, we propose a system where only the spatial and visual environment is simulated. [sent-46, score-0.174]
24 See section 4 for a discussion of different pedestrian navigation systems. [sent-47, score-0.886]
25 The server side consists of a broker module, navigation system, gameworld server, TTS engine, and a city model. [sent-49, score-0.984]
26 On the user’s side is a web-based client that consists of the simulated real-world and the interaction panel. [sent-50, score-0.333]
27 1 Game-world module Walking aimlessly in the simulated real world can be a boring task. [sent-52, score-0.265]
28 Therefore, instead of giving web users navigation tasks from A to B, we embed navigation tasks in a game-world overlaid on top of the simulated real world. [sent-53, score-1.751]
29 We developed a “treasure hunting” 50 game which consists of users solving several pieces of a puzzle to discover the location of the treasure chest. [sent-54, score-0.432]
30 In order to solve the puzzle, they interact with game characters (e. [sent-55, score-0.219]
31 This sets the user a number of navigation tasks to acquire the next clues until they find the treasure. [sent-58, score-0.934]
32 are laid out on real streets making it easy to develop a game without developing a game-world. [sent-62, score-0.295]
33 New game-worlds can be easily scripted using Javascript, where the location (latitude and longitude) and behaviour of the game characters are defined. [sent-63, score-0.248]
34 2 Broker The broker module is a web server that connects the web clients to their corresponding different navigation systems. [sent-66, score-0.969]
35 Subsequent messages from the users will be routed to the assigned navigation system. [sent-69, score-0.803]
36 The broker communicates with the navigation systems via a communication platform thereby ensuring that different navigation systems developed using different languages (such as C++, Java, Python, etc) are supported. [sent-70, score-1.581]
37 3 Navigation system The navigation system is the central component of this architecture, which provides the user instructions to reach their destinations. [sent-72, score-1.021]
38 Each navigation system is run as a server remotely. [sent-73, score-0.797]
39 When a user’s client connects to the server, it instantiates a navigation system object and assigns it to the user ex- clusively. [sent-74, score-1.034]
40 Every user is identified using a unique id (UUID), which is used to map the user to his/her respective navigation system. [sent-75, score-1.126]
41 The navigation system is introduced in the game scenario as a buddy system that will help the user in his objective: find the treasure. [sent-76, score-1.197]
42 The web client sends the user’s location to the system periodically (every few seconds). [sent-77, score-0.211]
43 4 TTS engine Alongside the navigation system we use the Cereproc text-to-speech engine that converts the utterances of the system into speech. [sent-79, score-0.742]
44 5 City Model The navigation system is supported by a database called the City Model. [sent-83, score-0.684]
45 The City Model is a GIS database containing a variety of data required to support navigation tasks. [sent-84, score-0.684]
46 The amenities and landmarks are represented as nodes (with latitude and longitude information). [sent-92, score-0.179]
47 org 51 subroutines to access the required information such as the nearest amenity, distance or route from A to B, etc. [sent-95, score-0.209]
48 These subroutines provide the interface between the navigation systems and the database. [sent-96, score-0.761]
49 6 Web-based client The web-based client is a JavaScript/HTML program running on the user’s web browser software (e. [sent-98, score-0.288]
50 It has two parts: the streetview panel and the interaction panel. [sent-102, score-0.317]
51 Streetview panel: the streetview panel presents a simulated real world visually to the user. [sent-103, score-0.469]
52 When the page loads, a Google Streetview client (Google Maps API) is created with an initial user coordinate. [sent-104, score-0.35]
53 Google Streetview is a web service that renders a panoramic view of real streets in major cities around the world. [sent-105, score-0.199]
54 This client allows the web user to get a panoramic view of the streets around the user’s virtual location. [sent-106, score-0.587]
55 A gameworld received from the server is overlaid on the simulated real world. [sent-107, score-0.421]
56 The user can walk around and interact with game characters using the arrow keys on his keyboard or the mouse. [sent-108, score-0.499]
57 As the user walks around, his location (stored in the form of latitude and longitude coordinates) gets updated locally. [sent-109, score-0.376]
58 Interaction panel: the web-client also includes an interaction panel that lets the user interact with his buddy navigation system. [sent-111, score-1.297]
59 In addition to user location information, users can also interact with the navigation system using textual utterances or their equivalents. [sent-112, score-1.132]
60 We provide users with two types of interaction panel: a GUI panel and a text panel. [sent-113, score-0.326]
61 For example, the user can request a route to a destination by selecting the street name from a drop down list and click on the Send button. [sent-117, score-0.479]
62 Of course, both types of inputs are parsed by the navigation system. [sent-121, score-0.684]
63 We also plan to add an additional input channel that can stream user speech to the navigation system in the future. [sent-122, score-0.905]
64 4 Candidate Navigation Systems This framework can be used to evaluate a variety of navigation systems. [sent-123, score-0.731]
65 Route navigation has been an interesting research topic for researchers in both geoinformatics and computational linguistics alike. [sent-124, score-0.684]
66 Several navigation prototype systems have been developed over the years. [sent-125, score-0.756]
67 Although there are several systems that do not use language as a means of communication for navigation tasks (instead using geotagged photographs (Beeharee and Steed, 2006; Hiley et al. [sent-126, score-0.761]
68 Therefore, our framework does not include systems that generate routes on 2D/3D maps as navigation aids. [sent-131, score-0.835]
69 These systems describe the entire route before the user starts navigating. [sent-133, score-0.433]
70 • • ‘In-situ’ or incremental route instruction syst‘eInm-ssi: tuh’e soer systems generate e ro iuntset riuncsttriouncti soynssincrementally along the route. [sent-137, score-0.323]
71 They keep track of the user’s location and issue the next instruction when the user reaches the next node on the planned route. [sent-142, score-0.384]
72 The next instruction tells the user how to reach the new next node. [sent-143, score-0.332]
73 Some systems do not keep track of the user, but require the user to request the next instruction when they reach the next node. [sent-144, score-0.406]
74 Interactive navigation systems: these systems are bacotthiv ein ncarveimgaetniotanl aynstde mins:ter thacetsivee s. [sent-145, score-0.724]
75 These systems keep track of the user’s location and proactively generate instructions based on user proximity to the next node. [sent-149, score-0.429]
76 In addition, they can interact with users by asking them questions about entities in their viewshed. [sent-150, score-0.201]
77 Questions like these will let the system assess the user’s location and thereby adapt its instruction to the situated context. [sent-153, score-0.19]
78 Objective metrics such as time taken by the user to finish each navigation task and the game, distance travelled, number of wrong turns, etc. [sent-155, score-0.905]
79 Subjective metrics based on each user’s ratings of different features of the system can be obtained through user satisfaction questionnaires. [sent-157, score-0.221]
80 The questionnaire consists of questions about the game, the buddy, and the user himself, for example: • • • Was the game engaging? [sent-159, score-0.442]
81 Figure 2: Snapshot of the web client • • • • Were the buddy instructions sWtaenrde? [sent-164, score-0.404]
82 easy to under- Were the buddy instructions ever wrong or misplaced? [sent-165, score-0.245]
83 6 Evaluation scenarios We aim to evaluate navigation systems under a variety of scenarios. [sent-168, score-0.724]
84 Therefore, one scenario for evaluation would be to test how robustly navigation systems handle erroneous GPS signals from the user’s end. [sent-170, score-0.724]
85 Output modalities: the output of navigation systems can lbitei presented tinp ttw oof m noavdiaglaittiieosn: text and speech. [sent-171, score-0.724]
86 While speech may enable a hands-free eyes-free navigation, text displayed on navigation aids like smartphones may increase cognitive load. [sent-172, score-0.716]
87 • Noise in user speech: for systems that take as input user speech, i:t i fso important t toh ahta ntadklee noise in such a channel. [sent-174, score-0.512]
88 Noise due to wind and traffic is most common in pedestrian scenarios. [sent-175, score-0.202]
89 • Adaptation to users: returning users may have lAedaarnpetdat tiohen layout osf: t rheetu game w usoerrlsd. [sent-177, score-0.282]
90 mAany in hatevreesting scenario is to examine how navigation systems adapt to user’s increasing spatial and visual knowledge. [sent-178, score-0.806]
91 Errors in GPS positioning of the user and noise in user speech can be simulated at the server end, thereby creating a range of challenging scenarios to evaluate the robustness of the systems. [sent-179, score-0.741]
92 7 The Shared Challenge We plan to organise a shared challenge for outdoor pedestrian route instruction generation, in which a variety of systems can be evaluated. [sent-180, score-0.599]
93 Participating research teams will be able to use our interfaces and modules to develop navigation systems. [sent-181, score-0.741]
94 Developed systems will be hosted on our challenge server and a web based evaluation will be organised in consultation with the research community (Janarthanam and Lemon, 2011). [sent-183, score-0.276]
95 8 Demonstration system At the demonstration, we will present the evaluation framework along with a demo navigation dialogue system. [sent-184, score-0.833]
96 The navigation system and other server modules will run on a remote server. [sent-186, score-0.827]
97 A nat- ural wayfinding exploiting photos in pedestrian navigation systems. [sent-194, score-0.886]
98 Audiogps: Spatial audio navigation with a minimal attention interface. [sent-232, score-0.725]
99 Virtual navigator: Developing a simulator for independent route learning. [sent-274, score-0.172]
100 A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. [sent-282, score-0.387]
wordName wordTfidf (topN-words)
[('navigation', 0.684), ('user', 0.221), ('pedestrian', 0.202), ('route', 0.172), ('game', 0.163), ('panel', 0.132), ('simulated', 0.129), ('client', 0.129), ('buddy', 0.129), ('users', 0.119), ('instructions', 0.116), ('server', 0.113), ('instruction', 0.111), ('streetview', 0.11), ('virtual', 0.106), ('dialogue', 0.102), ('environment', 0.092), ('janarthanam', 0.092), ('gps', 0.08), ('interaction', 0.075), ('broker', 0.074), ('gameworld', 0.074), ('simulate', 0.068), ('real', 0.068), ('gui', 0.064), ('routes', 0.064), ('simulation', 0.064), ('streets', 0.064), ('tts', 0.059), ('oliver', 0.059), ('interact', 0.056), ('spatial', 0.056), ('auditory', 0.055), ('buttons', 0.055), ('indoor', 0.055), ('longitude', 0.055), ('location', 0.052), ('haptic', 0.048), ('latitude', 0.048), ('ubiquitous', 0.048), ('framework', 0.047), ('landmarks', 0.044), ('lemon', 0.044), ('byron', 0.041), ('audio', 0.041), ('systems', 0.04), ('mobile', 0.04), ('city', 0.039), ('walking', 0.039), ('module', 0.038), ('challenge', 0.037), ('beeharee', 0.037), ('bosman', 0.037), ('coral', 0.037), ('georgila', 0.037), ('geotagged', 0.037), ('hiley', 0.037), ('malaka', 0.037), ('mcgookin', 0.037), ('navigator', 0.037), ('oberlander', 0.037), ('organise', 0.037), ('overlaid', 0.037), ('panoramic', 0.037), ('richter', 0.037), ('schatzmann', 0.037), ('srini', 0.037), ('subroutines', 0.037), ('treasure', 0.037), ('zandbergen', 0.037), ('koller', 0.035), ('request', 0.034), ('dale', 0.033), ('behaviour', 0.033), ('developed', 0.032), ('amenities', 0.032), ('holland', 0.032), ('questionnaire', 0.032), ('smartphones', 0.032), ('stoia', 0.032), ('priori', 0.032), ('keys', 0.032), ('web', 0.03), ('noise', 0.03), ('world', 0.03), ('modules', 0.03), ('organised', 0.029), ('puzzle', 0.029), ('clues', 0.029), ('jones', 0.029), ('engine', 0.029), ('teams', 0.027), ('hosted', 0.027), ('keyboard', 0.027), ('architecture', 0.027), ('thereby', 0.027), ('visual', 0.026), ('street', 0.026), ('questions', 0.026), ('click', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 24 acl-2012-A Web-based Evaluation Framework for Spatial Instruction-Giving Systems
Author: Srinivasan Janarthanam ; Oliver Lemon ; Xingkun Liu
Abstract: We demonstrate a web-based environment for development and testing of different pedestrian route instruction-giving systems. The environment contains a City Model, a TTS interface, a game-world, and a user GUI including a simulated street-view. We describe the environment and components, the metrics that can be used for the evaluation of pedestrian route instruction-giving systems, and the shared challenge which is being organised using this environment.
2 0.42532447 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition
Author: David Chen
Abstract: Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (201 1) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the stateof-the-art results. We show that by changing the grammar of the formal meaning represen- . tation language and training on additional data collected from Amazon’s Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
3 0.20263773 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments
Author: Luciana Benotti ; Martin Villalba ; Tessa Lau ; Julian Cerruti
Abstract: Previous approaches to instruction interpretation have required either extensive domain adaptation or manually annotated corpora. This paper presents a novel approach to instruction interpretation that leverages a large amount of unannotated, easy-to-collect data from humans interacting with a virtual world. We compare several algorithms for automatically segmenting and discretizing this data into (utterance, reaction) pairs and training a classifier to predict reactions given the next utterance. Our empirical analysis shows that the best algorithm achieves 70% accuracy on this task, with no manual annotation required. 1 Introduction and motivation Mapping instructions into automatically executable actions would enable the creation of natural lan- , guage interfaces to many applications (Lau et al., 2009; Branavan et al., 2009; Orkin and Roy, 2009). In this paper, we focus on the task of navigation and manipulation of a virtual environment (Vogel and Jurafsky, 2010; Chen and Mooney, 2011). Current symbolic approaches to the problem are brittle to the natural language variation present in instructions and require intensive rule authoring to be fit for a new task (Dzikovska et al., 2008). Current statistical approaches require extensive manual annotations of the corpora used for training (MacMahon et al., 2006; Matuszek et al., 2010; Gorniak and Roy, 2007; Rieser and Lemon, 2010). Manual annotation and rule authoring by natural language engineering experts are bottlenecks for developing conversational systems for new domains. 181 t e s s al au @ us . ibm . com, j ce rrut i ar .ibm . com @ This paper proposes a fully automated approach to interpreting natural language instructions to complete a task in a virtual world based on unsupervised recordings of human-human interactions perform- ing that task in that virtual world. Given unannotated corpora collected from humans following other humans’ instructions, our system automatically segments the corpus into labeled training data for a classification algorithm. Our interpretation algorithm is based on the observation that similar instructions uttered in similar contexts should lead to similar actions being taken in the virtual world. Given a previously unseen instruction, our system outputs actions that can be directly executed in the virtual world, based on what humans did when given similar instructions in the past. 2 Corpora situated in virtual worlds Our environment consists of six virtual worlds designed for the natural language generation shared task known as the GIVE Challenge (Koller et al., 2010), where a pair of partners must collaborate to solve a task in a 3D space (Figure 1). The “instruction follower” (IF) can move around in the virtual world, but has no knowledge of the task. The “instruction giver” (IG) types instructions to the IF in order to guide him to accomplish the task. Each corpus contains the IF’s actions and position recorded every 200 milliseconds, as well as the IG’s instruc- tions with their timestamps. We used two corpora for our experiments. The Cm corpus (Gargett et al., 2010) contains instructions given by multiple people, consisting of 37 games spanning 2163 instructions over 8: 17 hs. The Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 8s1–186, Figure 1: A screenshot of a virtual world. The world consists of interconnecting hallways, rooms and objects Cs corpus (Benotti and Denis, 2011), gathered using a single IG, is composed of 63 games and 3417 in- structions, and was recorded in a span of 6:09 hs. It took less than 15 hours to collect the corpora through the web and the subjects reported that the experiment was fun. While the environment is restricted, people describe the same route and the same objects in extremely different ways. Below are some examples of instructions from our corpus all given for the same route shown in Figure 1. 1) out 2) walk down the passage 3) nowgo [sic] to the pink room 4) back to the room with the plant 5) Go through the door on the left 6) go through opening with yellow wall paper People describe routes using landmarks (4) or specific actions (2). They may describe the same object differently (5 vs 6). Instructions also differ in their scope (3 vs 1). Thus, even ignoring spelling and grammatical errors, navigation instructions contain considerable variation which makes interpreting them a challenging problem. 3 Learning from previous interpretations Our algorithm consists of two phases: annotation and interpretation. Annotation is performed only once and consists of automatically associating each IG instruction to an IF reaction. Interpretation is performed every time the system receives an instruc182 tion and consists of predicting an appropriate reaction given reactions observed in the corpus. Our method is based on the assumption that a reaction captures the semantics of the instruction that caused it. Therefore, if two utterances result in the same reaction, they are paraphrases of each other, and similar utterances should generate the same reaction. This approach enables us to predict reactions for previously-unseen instructions. 3.1 Annotation phase The key challenge in learning from massive amounts of easily-collected data is to automatically annotate an unannotated corpus. Our annotation method consists of two parts: first, segmenting a low-level interaction trace into utterances and corresponding reactions, and second, discretizing those reactions into canonical action sequences. Segmentation enables our algorithm to learn from traces of IFs interacting directly with a virtual world. Since the IF can move freely in the virtual world, his actions are a stream of continuous behavior. Segmentation divides these traces into reactions that follow from each utterance of the IG. Consider the following example starting at the situation shown in Figure 1: IG(1): go through the yellow opening IF(2): [walks out of the room] IF(3): [turns left at the intersection] IF(4): [enters the room with the sofa] IG(5): stop It is not clear whether the IF is doing h3, 4i because h neo tis c reacting htoe r1 t or Fbec isadu soei hge h 3is, being proactive. While one could manually annotate this data to remove extraneous actions, our goal is to develop automated solutions that enable learning from massive amounts of data. We decided to approach this problem by experimenting with two alternative formal definitions: 1) a strict definition that considers the maximum reaction according to the IF behavior, and 2) a loose defini- tion based on the empirical observation that, in situated interaction, most instructions are constrained by the current visually perceived affordances (Gibson, 1979; Stoia et al., 2006). We formally define behavior segmentation (Bhv) as follows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered. In the example, instruction 1corresponds to h2, 3, 4i . We formally d inefsitnrue visibility segmentation (Vis) as f Wolelows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered or right after the IF leaves the area visible at 360◦ from where uk was uttered. In the example, instruction 1’s reaction would be limited to h2i because the intersection is nwootu vldisi bbele l ifmroimte dw htoer he2 tihe b eicnasutrsuec ttihoen was suetctetiroend. The Bhv and Vis methods define how to segment an interaction trace into utterances and their corresponding reactions. However, users frequently perform noisy behavior that is irrelevant to the goal of the task. For example, after hearing an instruction, an IF might go into the wrong room, realize the error, and leave the room. A reaction should not in- clude such irrelevant actions. In addition, IFs may accomplish the same goal using different behaviors: two different IFs may interpret “go to the pink room” by following different paths to the same destination. We would like to be able to generalize both reactions into one canonical reaction. As a result, our approach discretizes reactions into higher-level action sequences with less noise and less variation. Our discretization algorithm uses an automated planner and a planning representation of the task. This planning representation includes: (1) the task goal, (2) the actions which can be taken in the virtual world, and (3) the current state of the virtual world. Using the planning representation, the planner calculates an optimal path between the starting and ending states of the reaction, eliminating all unnecessary actions. While we use the classical planner FF (Hoffmann, 2003), our technique could also work with classical planning (Nau et al., 2004) or other techniques such as probabilistic planning (Bonet and Geffner, 2005). It is also not dependent on a particular discretization of the world in terms of actions. Now we are ready to define canonical reaction ck formally. Let Sk be the state of the virtual world when instruction uk was uttered, Sk+1 be the state of the world where the reaction ends (as defined by Bhv or Vis segmentation), and D be the planning domain representation of the virtual world. The canonical reaction to uk is defined as the sequence of actions 183 returned by the planner with Sk as initial state, Sk+1 as goal state and D as planning domain. 3.2 Interpretation phase The annotation phase results in a collection of (uk, ck) pairs. The interpretation phase uses these pairs to interpret new utterances in three steps. First, we filter the set of pairs into those whose reactions can be directly executed from the current IF position. Second, we group the filtered pairs according to their reactions. Third, we select the group with utterances most similar to the new utterance, and output that group’s reaction. Figure 2 shows the output of the first two steps: three groups of pairs whose reactions can all be executed from the IF’s current position. Figure 2: Utterance groups for this situation. Colored arrows show the reaction associated with each group. We treat the third step, selecting the most similar group for a new utterance, as a classification problem. We compare three different classification methods. One method uses nearest-neighbor classification with three different similarity metrics: Jaccard and Overlap coefficients (both of which measure the degree of overlap between two sets, differing only in the normalization of the final value (Nikravesh et al., 2005)), and Levenshtein Distance (a string met- ric for measuring the amount of differences between two sequences of words (Levenshtein, 1966)). Our second classification method employs a strategy in which we considered each group as a set of possible machine translations of our utterance, using the BLEU measure (Papineni et al., 2002) to select which group could be considered the best translation of our utterance. Finally, we trained an SVM classifier (Cortes and Vapnik, 1995) using the unigrams Corpus Cm Corpus Cs Algorithm Bhv Vis Bhv Vis Jaccard47%54%54%70% Overlap BLEU SVM Levenshtein 43% 44% 33% 21% 53% 52% 29% 20% 45% 54% 45% 8% 60% 50% 29% 17% Table 1: Accuracy comparison between Cm and Cs for Bhv and Vis segmentation of each paraphrase and the position of the IF as features, and setting their group as the output class using a libSVM wrapper (Chang and Lin, 2011). When the system misinterprets an instruction we use a similar approach to what people do in order to overcome misunderstandings. If the system executes an incorrect reaction, the IG can tell the system to cancel its current interpretation and try again using a paraphrase, selecting a different reaction. 4 Evaluation For the evaluation phase, we annotated both the Cm and Cs corpora entirely, and then we split them in an 80/20 proportion; the first 80% of data collected in each virtual world was used for training, while the remaining 20% was used for testing. For each pair (uk, ck) in the testing set, we used our algorithm to predict the reaction to the selected utterance, and then compared this result against the automatically annotated reaction. Table 1 shows the results. Comparing the Bhv and Vis segmentation strategies, Vis tends to obtain better results than Bhv. In addition, accuracy on the Cs corpus was generally higher than Cm. Given that Cs contained only one IG, we believe this led to less variability in the instructions and less noise in the training data. We evaluated the impact of user corrections by simulating them using the existing corpus. In case of a wrong response, the algorithm receives a second utterance with the same reaction (a paraphrase of the previous one). Then the new utterance is tested over the same set of possible groups, except for the one which was returned before. If the correct reaction is not predicted after four tries, or there are no utterances with the same reaction, the predictions are registered as wrong. To measure the effects of user corrections vs. without, we used a different evalu184 ation process for this algorithm: first, we split the corpus in a 50/50 proportion, and then we moved correctly predicted utterances from the testing set towards training, until either there was nothing more to learn or the training set reached 80% of the entire corpus size. As expected, user corrections significantly improve accuracy, as shown in Figure 3. The worst algorithm’s results improve linearly with each try, while the best ones behave asymptotically, barely improving after the second try. The best algorithm reaches 92% with just one correction from the IG. 5 Discussion and future work We presented an approach to instruction interpretation which learns from non-annotated logs of human behavior. Our empirical analysis shows that our best algorithm achieves 70% accuracy on this task, with no manual annotation required. When corrections are added, accuracy goes up to 92% for just one correction. We consider our results promising since state of the art semi-unsupervised approaches to instruction interpretation (Chen and Mooney, 2011) reports a 55% accuracy on manually segmented data. We plan to compare our system’s performance against human performance in comparable situations. Our informal observations of the GIVE corpus indicate that humans often follow instructions incorrectly, so our automated system’s performance may be on par with human performance. Although we have presented our approach in the context of 3D virtual worlds, we believe our technique is also applicable to other domains such as the web, video games, or Human Robot Interaction. Figure 3: Accuracy values with corrections over Cs References Luciana Benotti and Alexandre Denis. 2011. CL system: Giving instructions by corpus based selection. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 296–301, Nancy, France, September. Association for Computational Linguistics. Blai Bonet and H ´ector Geffner. 2005. mGPT: a probabilistic planner based on heuristic search. Journal of Artificial Intelligence Research, 24:933–944. S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 82–90, Suntec, Singapore, August. Association for Computational Linguistics. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27: 1– 27:27. Software available at http : / /www . cs ie . ntu .edu .tw/ ˜ c j l in/ l ibsvm. David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI2011), pages 859–865, August. Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning, 20:273–297. Myroslava O. Dzikovska, James F. Allen, and Mary D. Swift. 2008. Linking semantic and knowledge representations in a multi-domain dialogue system. Journal of Logic and Computation, 18:405–430, June. Andrew Gargett, Konstantina Garoufi, Alexander Koller, and Kristina Striegnitz. 2010. The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC), Malta. James J. Gibson. 1979. The Ecological Approach to Visual Perception, volume 40. Houghton Mifflin. Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cognitive Science, 3 1(2): 197–231. J o¨rg Hoffmann. 2003. The Metric-FF planning system: Translating ”ignoring delete lists” to numeric state variables. Journal of Artificial Intelligence Research (JAIR), 20:291–341. Alexander Koller, Kristina Striegnitz, Andrew Gargett, Donna Byron, Justine Cassell, Robert Dale, Johanna Moore, and Jon Oberlander. 2010. Report on the second challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th In185 ternational Natural Language Generation Conference (INLG), Dublin. Tessa Lau, Clemens Drews, and Jeffrey Nichols. 2009. Interpreting written how-to instructions. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1433–1438, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artifi- cial Intelligence - Volume 2, pages 1475–1482. AAAI Press. Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction, HRI ’ 10, pages 251–258, New York, NY, USA. ACM. Dana Nau, Malik Ghallab, and Paolo Traverso. 2004. Automated Planning: Theory & Practice. Morgan Kaufmann Publishers Inc., California, USA. Masoud Nikravesh, Tomohiro Takagi, Masanori Tajima, Akiyoshi Shinmura, Ryosuke Ohgaya, Koji Taniguchi, Kazuyosi Kawahara, Kouta Fukano, and Akiko Aizawa. 2005. Soft computing for perception-based decision processing and analysis: Web-based BISCDSS. In Masoud Nikravesh, Lotfi Zadeh, and Janusz Kacprzyk, editors, Soft Computing for Information Processing and Analysis, volume 164 of Studies in Fuzziness and Soft Computing, chapter 4, pages 93– 188. Springer Berlin / Heidelberg. Jeff Orkin and Deb Roy. 2009. Automatic learning and generation of social behavior from collective human gameplay. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent SystemsVolume 1, volume 1, pages 385–392. International Foundation for Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 3 11–3 18, Stroudsburg, PA, USA. Association for Computational Linguistics. Verena Rieser and Oliver Lemon. 2010. Learning human multimodal dialogue strategies. Natural Language Engineering, 16:3–23. Laura Stoia, Donna K. Byron, Darla Magdalene Shockley, and Eric Fosler-Lussier. 2006. Sentence planning for realtime navigational instructions. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACLShort ’06, pages 157–160, Stroudsburg, PA, USA. Association for Computational Linguistics. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings ofthe 48th Annual Meeting of the Association for Computational Linguistics, ACL ’ 10, pages 806–814, Stroudsburg, PA, USA. Association for Computational Linguistics. 186
4 0.11386483 114 acl-2012-IRIS: a Chat-oriented Dialogue System based on the Vector Space Model
Author: Rafael E. Banchs ; Haizhou Li
Abstract: This system demonstration paper presents IRIS (Informal Response Interactive System), a chat-oriented dialogue system based on the vector space model framework. The system belongs to the class of examplebased dialogue systems and builds its chat capabilities on a dual search strategy over a large collection of dialogue samples. Additional strategies allowing for system adaptation and learning implemented over the same vector model space framework are also described and discussed. 1
5 0.076592527 160 acl-2012-Personalized Normalization for a Multilingual Chat System
Author: Ai Ti Aw ; Lian Hau Lee
Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1
6 0.071912535 149 acl-2012-Movie-DiC: a Movie Dialogue Corpus for Research and Development
7 0.066957422 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters
8 0.063958853 166 acl-2012-Qualitative Modeling of Spatial Prepositions and Motion Expressions
9 0.062757351 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench
10 0.061221186 205 acl-2012-Tweet Recommendation with Graph Co-Ranking
11 0.055095673 113 acl-2012-INPROwidth.3emiSS: A Component for Just-In-Time Incremental Speech Synthesis
12 0.054394346 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content
13 0.049659148 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
14 0.049284451 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition
15 0.04892299 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling
16 0.046585094 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora
17 0.043917798 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks
18 0.041622099 70 acl-2012-Demonstration of IlluMe: Creating Ambient According to Instant Message Logs
19 0.039753977 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis
20 0.038407575 186 acl-2012-Structuring E-Commerce Inventory
topicId topicWeight
[(0, -0.094), (1, 0.055), (2, 0.002), (3, 0.027), (4, 0.009), (5, 0.11), (6, 0.076), (7, 0.062), (8, -0.016), (9, 0.11), (10, -0.044), (11, 0.197), (12, -0.266), (13, 0.24), (14, -0.234), (15, -0.016), (16, -0.347), (17, 0.108), (18, 0.007), (19, 0.017), (20, 0.012), (21, 0.065), (22, -0.014), (23, 0.275), (24, 0.121), (25, 0.048), (26, -0.145), (27, 0.129), (28, -0.059), (29, 0.057), (30, -0.066), (31, 0.079), (32, -0.05), (33, 0.005), (34, -0.023), (35, -0.003), (36, 0.052), (37, 0.05), (38, -0.001), (39, 0.034), (40, 0.074), (41, -0.095), (42, 0.052), (43, 0.026), (44, -0.051), (45, -0.092), (46, 0.012), (47, 0.024), (48, -0.043), (49, 0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.97449213 24 acl-2012-A Web-based Evaluation Framework for Spatial Instruction-Giving Systems
Author: Srinivasan Janarthanam ; Oliver Lemon ; Xingkun Liu
Abstract: We demonstrate a web-based environment for development and testing of different pedestrian route instruction-giving systems. The environment contains a City Model, a TTS interface, a game-world, and a user GUI including a simulated street-view. We describe the environment and components, the metrics that can be used for the evaluation of pedestrian route instruction-giving systems, and the shared challenge which is being organised using this environment.
2 0.8099367 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition
Author: David Chen
Abstract: Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (201 1) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the stateof-the-art results. We show that by changing the grammar of the formal meaning represen- . tation language and training on additional data collected from Amazon’s Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
3 0.73324233 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments
Author: Luciana Benotti ; Martin Villalba ; Tessa Lau ; Julian Cerruti
Abstract: Previous approaches to instruction interpretation have required either extensive domain adaptation or manually annotated corpora. This paper presents a novel approach to instruction interpretation that leverages a large amount of unannotated, easy-to-collect data from humans interacting with a virtual world. We compare several algorithms for automatically segmenting and discretizing this data into (utterance, reaction) pairs and training a classifier to predict reactions given the next utterance. Our empirical analysis shows that the best algorithm achieves 70% accuracy on this task, with no manual annotation required. 1 Introduction and motivation Mapping instructions into automatically executable actions would enable the creation of natural lan- , guage interfaces to many applications (Lau et al., 2009; Branavan et al., 2009; Orkin and Roy, 2009). In this paper, we focus on the task of navigation and manipulation of a virtual environment (Vogel and Jurafsky, 2010; Chen and Mooney, 2011). Current symbolic approaches to the problem are brittle to the natural language variation present in instructions and require intensive rule authoring to be fit for a new task (Dzikovska et al., 2008). Current statistical approaches require extensive manual annotations of the corpora used for training (MacMahon et al., 2006; Matuszek et al., 2010; Gorniak and Roy, 2007; Rieser and Lemon, 2010). Manual annotation and rule authoring by natural language engineering experts are bottlenecks for developing conversational systems for new domains. 181 t e s s al au @ us . ibm . com, j ce rrut i ar .ibm . com @ This paper proposes a fully automated approach to interpreting natural language instructions to complete a task in a virtual world based on unsupervised recordings of human-human interactions perform- ing that task in that virtual world. Given unannotated corpora collected from humans following other humans’ instructions, our system automatically segments the corpus into labeled training data for a classification algorithm. Our interpretation algorithm is based on the observation that similar instructions uttered in similar contexts should lead to similar actions being taken in the virtual world. Given a previously unseen instruction, our system outputs actions that can be directly executed in the virtual world, based on what humans did when given similar instructions in the past. 2 Corpora situated in virtual worlds Our environment consists of six virtual worlds designed for the natural language generation shared task known as the GIVE Challenge (Koller et al., 2010), where a pair of partners must collaborate to solve a task in a 3D space (Figure 1). The “instruction follower” (IF) can move around in the virtual world, but has no knowledge of the task. The “instruction giver” (IG) types instructions to the IF in order to guide him to accomplish the task. Each corpus contains the IF’s actions and position recorded every 200 milliseconds, as well as the IG’s instruc- tions with their timestamps. We used two corpora for our experiments. The Cm corpus (Gargett et al., 2010) contains instructions given by multiple people, consisting of 37 games spanning 2163 instructions over 8: 17 hs. The Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 8s1–186, Figure 1: A screenshot of a virtual world. The world consists of interconnecting hallways, rooms and objects Cs corpus (Benotti and Denis, 2011), gathered using a single IG, is composed of 63 games and 3417 in- structions, and was recorded in a span of 6:09 hs. It took less than 15 hours to collect the corpora through the web and the subjects reported that the experiment was fun. While the environment is restricted, people describe the same route and the same objects in extremely different ways. Below are some examples of instructions from our corpus all given for the same route shown in Figure 1. 1) out 2) walk down the passage 3) nowgo [sic] to the pink room 4) back to the room with the plant 5) Go through the door on the left 6) go through opening with yellow wall paper People describe routes using landmarks (4) or specific actions (2). They may describe the same object differently (5 vs 6). Instructions also differ in their scope (3 vs 1). Thus, even ignoring spelling and grammatical errors, navigation instructions contain considerable variation which makes interpreting them a challenging problem. 3 Learning from previous interpretations Our algorithm consists of two phases: annotation and interpretation. Annotation is performed only once and consists of automatically associating each IG instruction to an IF reaction. Interpretation is performed every time the system receives an instruc182 tion and consists of predicting an appropriate reaction given reactions observed in the corpus. Our method is based on the assumption that a reaction captures the semantics of the instruction that caused it. Therefore, if two utterances result in the same reaction, they are paraphrases of each other, and similar utterances should generate the same reaction. This approach enables us to predict reactions for previously-unseen instructions. 3.1 Annotation phase The key challenge in learning from massive amounts of easily-collected data is to automatically annotate an unannotated corpus. Our annotation method consists of two parts: first, segmenting a low-level interaction trace into utterances and corresponding reactions, and second, discretizing those reactions into canonical action sequences. Segmentation enables our algorithm to learn from traces of IFs interacting directly with a virtual world. Since the IF can move freely in the virtual world, his actions are a stream of continuous behavior. Segmentation divides these traces into reactions that follow from each utterance of the IG. Consider the following example starting at the situation shown in Figure 1: IG(1): go through the yellow opening IF(2): [walks out of the room] IF(3): [turns left at the intersection] IF(4): [enters the room with the sofa] IG(5): stop It is not clear whether the IF is doing h3, 4i because h neo tis c reacting htoe r1 t or Fbec isadu soei hge h 3is, being proactive. While one could manually annotate this data to remove extraneous actions, our goal is to develop automated solutions that enable learning from massive amounts of data. We decided to approach this problem by experimenting with two alternative formal definitions: 1) a strict definition that considers the maximum reaction according to the IF behavior, and 2) a loose defini- tion based on the empirical observation that, in situated interaction, most instructions are constrained by the current visually perceived affordances (Gibson, 1979; Stoia et al., 2006). We formally define behavior segmentation (Bhv) as follows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered. In the example, instruction 1corresponds to h2, 3, 4i . We formally d inefsitnrue visibility segmentation (Vis) as f Wolelows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered or right after the IF leaves the area visible at 360◦ from where uk was uttered. In the example, instruction 1’s reaction would be limited to h2i because the intersection is nwootu vldisi bbele l ifmroimte dw htoer he2 tihe b eicnasutrsuec ttihoen was suetctetiroend. The Bhv and Vis methods define how to segment an interaction trace into utterances and their corresponding reactions. However, users frequently perform noisy behavior that is irrelevant to the goal of the task. For example, after hearing an instruction, an IF might go into the wrong room, realize the error, and leave the room. A reaction should not in- clude such irrelevant actions. In addition, IFs may accomplish the same goal using different behaviors: two different IFs may interpret “go to the pink room” by following different paths to the same destination. We would like to be able to generalize both reactions into one canonical reaction. As a result, our approach discretizes reactions into higher-level action sequences with less noise and less variation. Our discretization algorithm uses an automated planner and a planning representation of the task. This planning representation includes: (1) the task goal, (2) the actions which can be taken in the virtual world, and (3) the current state of the virtual world. Using the planning representation, the planner calculates an optimal path between the starting and ending states of the reaction, eliminating all unnecessary actions. While we use the classical planner FF (Hoffmann, 2003), our technique could also work with classical planning (Nau et al., 2004) or other techniques such as probabilistic planning (Bonet and Geffner, 2005). It is also not dependent on a particular discretization of the world in terms of actions. Now we are ready to define canonical reaction ck formally. Let Sk be the state of the virtual world when instruction uk was uttered, Sk+1 be the state of the world where the reaction ends (as defined by Bhv or Vis segmentation), and D be the planning domain representation of the virtual world. The canonical reaction to uk is defined as the sequence of actions 183 returned by the planner with Sk as initial state, Sk+1 as goal state and D as planning domain. 3.2 Interpretation phase The annotation phase results in a collection of (uk, ck) pairs. The interpretation phase uses these pairs to interpret new utterances in three steps. First, we filter the set of pairs into those whose reactions can be directly executed from the current IF position. Second, we group the filtered pairs according to their reactions. Third, we select the group with utterances most similar to the new utterance, and output that group’s reaction. Figure 2 shows the output of the first two steps: three groups of pairs whose reactions can all be executed from the IF’s current position. Figure 2: Utterance groups for this situation. Colored arrows show the reaction associated with each group. We treat the third step, selecting the most similar group for a new utterance, as a classification problem. We compare three different classification methods. One method uses nearest-neighbor classification with three different similarity metrics: Jaccard and Overlap coefficients (both of which measure the degree of overlap between two sets, differing only in the normalization of the final value (Nikravesh et al., 2005)), and Levenshtein Distance (a string met- ric for measuring the amount of differences between two sequences of words (Levenshtein, 1966)). Our second classification method employs a strategy in which we considered each group as a set of possible machine translations of our utterance, using the BLEU measure (Papineni et al., 2002) to select which group could be considered the best translation of our utterance. Finally, we trained an SVM classifier (Cortes and Vapnik, 1995) using the unigrams Corpus Cm Corpus Cs Algorithm Bhv Vis Bhv Vis Jaccard47%54%54%70% Overlap BLEU SVM Levenshtein 43% 44% 33% 21% 53% 52% 29% 20% 45% 54% 45% 8% 60% 50% 29% 17% Table 1: Accuracy comparison between Cm and Cs for Bhv and Vis segmentation of each paraphrase and the position of the IF as features, and setting their group as the output class using a libSVM wrapper (Chang and Lin, 2011). When the system misinterprets an instruction we use a similar approach to what people do in order to overcome misunderstandings. If the system executes an incorrect reaction, the IG can tell the system to cancel its current interpretation and try again using a paraphrase, selecting a different reaction. 4 Evaluation For the evaluation phase, we annotated both the Cm and Cs corpora entirely, and then we split them in an 80/20 proportion; the first 80% of data collected in each virtual world was used for training, while the remaining 20% was used for testing. For each pair (uk, ck) in the testing set, we used our algorithm to predict the reaction to the selected utterance, and then compared this result against the automatically annotated reaction. Table 1 shows the results. Comparing the Bhv and Vis segmentation strategies, Vis tends to obtain better results than Bhv. In addition, accuracy on the Cs corpus was generally higher than Cm. Given that Cs contained only one IG, we believe this led to less variability in the instructions and less noise in the training data. We evaluated the impact of user corrections by simulating them using the existing corpus. In case of a wrong response, the algorithm receives a second utterance with the same reaction (a paraphrase of the previous one). Then the new utterance is tested over the same set of possible groups, except for the one which was returned before. If the correct reaction is not predicted after four tries, or there are no utterances with the same reaction, the predictions are registered as wrong. To measure the effects of user corrections vs. without, we used a different evalu184 ation process for this algorithm: first, we split the corpus in a 50/50 proportion, and then we moved correctly predicted utterances from the testing set towards training, until either there was nothing more to learn or the training set reached 80% of the entire corpus size. As expected, user corrections significantly improve accuracy, as shown in Figure 3. The worst algorithm’s results improve linearly with each try, while the best ones behave asymptotically, barely improving after the second try. The best algorithm reaches 92% with just one correction from the IG. 5 Discussion and future work We presented an approach to instruction interpretation which learns from non-annotated logs of human behavior. Our empirical analysis shows that our best algorithm achieves 70% accuracy on this task, with no manual annotation required. When corrections are added, accuracy goes up to 92% for just one correction. We consider our results promising since state of the art semi-unsupervised approaches to instruction interpretation (Chen and Mooney, 2011) reports a 55% accuracy on manually segmented data. We plan to compare our system’s performance against human performance in comparable situations. Our informal observations of the GIVE corpus indicate that humans often follow instructions incorrectly, so our automated system’s performance may be on par with human performance. Although we have presented our approach in the context of 3D virtual worlds, we believe our technique is also applicable to other domains such as the web, video games, or Human Robot Interaction. Figure 3: Accuracy values with corrections over Cs References Luciana Benotti and Alexandre Denis. 2011. CL system: Giving instructions by corpus based selection. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 296–301, Nancy, France, September. Association for Computational Linguistics. Blai Bonet and H ´ector Geffner. 2005. mGPT: a probabilistic planner based on heuristic search. Journal of Artificial Intelligence Research, 24:933–944. S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 82–90, Suntec, Singapore, August. Association for Computational Linguistics. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27: 1– 27:27. Software available at http : / /www . cs ie . ntu .edu .tw/ ˜ c j l in/ l ibsvm. David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI2011), pages 859–865, August. Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning, 20:273–297. Myroslava O. Dzikovska, James F. Allen, and Mary D. Swift. 2008. Linking semantic and knowledge representations in a multi-domain dialogue system. Journal of Logic and Computation, 18:405–430, June. Andrew Gargett, Konstantina Garoufi, Alexander Koller, and Kristina Striegnitz. 2010. The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC), Malta. James J. Gibson. 1979. The Ecological Approach to Visual Perception, volume 40. Houghton Mifflin. Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cognitive Science, 3 1(2): 197–231. J o¨rg Hoffmann. 2003. The Metric-FF planning system: Translating ”ignoring delete lists” to numeric state variables. Journal of Artificial Intelligence Research (JAIR), 20:291–341. Alexander Koller, Kristina Striegnitz, Andrew Gargett, Donna Byron, Justine Cassell, Robert Dale, Johanna Moore, and Jon Oberlander. 2010. Report on the second challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th In185 ternational Natural Language Generation Conference (INLG), Dublin. Tessa Lau, Clemens Drews, and Jeffrey Nichols. 2009. Interpreting written how-to instructions. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1433–1438, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artifi- cial Intelligence - Volume 2, pages 1475–1482. AAAI Press. Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction, HRI ’ 10, pages 251–258, New York, NY, USA. ACM. Dana Nau, Malik Ghallab, and Paolo Traverso. 2004. Automated Planning: Theory & Practice. Morgan Kaufmann Publishers Inc., California, USA. Masoud Nikravesh, Tomohiro Takagi, Masanori Tajima, Akiyoshi Shinmura, Ryosuke Ohgaya, Koji Taniguchi, Kazuyosi Kawahara, Kouta Fukano, and Akiko Aizawa. 2005. Soft computing for perception-based decision processing and analysis: Web-based BISCDSS. In Masoud Nikravesh, Lotfi Zadeh, and Janusz Kacprzyk, editors, Soft Computing for Information Processing and Analysis, volume 164 of Studies in Fuzziness and Soft Computing, chapter 4, pages 93– 188. Springer Berlin / Heidelberg. Jeff Orkin and Deb Roy. 2009. Automatic learning and generation of social behavior from collective human gameplay. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent SystemsVolume 1, volume 1, pages 385–392. International Foundation for Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 3 11–3 18, Stroudsburg, PA, USA. Association for Computational Linguistics. Verena Rieser and Oliver Lemon. 2010. Learning human multimodal dialogue strategies. Natural Language Engineering, 16:3–23. Laura Stoia, Donna K. Byron, Darla Magdalene Shockley, and Eric Fosler-Lussier. 2006. Sentence planning for realtime navigational instructions. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACLShort ’06, pages 157–160, Stroudsburg, PA, USA. Association for Computational Linguistics. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings ofthe 48th Annual Meeting of the Association for Computational Linguistics, ACL ’ 10, pages 806–814, Stroudsburg, PA, USA. Association for Computational Linguistics. 186
4 0.32180339 166 acl-2012-Qualitative Modeling of Spatial Prepositions and Motion Expressions
Author: Inderjeet Mani ; James Pustejovsky
Abstract: unkown-abstract
5 0.29734254 129 acl-2012-Learning High-Level Planning from Text
Author: S.R.K. Branavan ; Nate Kushman ; Tao Lei ; Regina Barzilay
Abstract: Comprehending action preconditions and effects is an essential step in modeling the dynamics of the world. In this paper, we express the semantics of precondition relations extracted from text in terms of planning operations. The challenge of modeling this connection is to ground language at the level of relations. This type of grounding enables us to create high-level plans based on language abstractions. Our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. We implement this idea in the reinforcement learning framework using feedback automatically obtained from plan execution attempts. When applied to a complex virtual world and text describing that world, our relation extraction technique performs on par with a supervised baseline, yielding an F-measure of 66% compared to the baseline’s 65%. Additionally, we show that a high-level planner utilizing these extracted relations significantly outperforms a strong, text unaware baseline successfully completing 80% of planning tasks as compared to 69% for the baseline.1 –
6 0.29496673 114 acl-2012-IRIS: a Chat-oriented Dialogue System based on the Vector Space Model
7 0.28588328 70 acl-2012-Demonstration of IlluMe: Creating Ambient According to Instant Message Logs
8 0.25294378 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords
9 0.25067315 160 acl-2012-Personalized Normalization for a Multilingual Chat System
10 0.24584985 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation
11 0.22036208 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation
12 0.22025394 149 acl-2012-Movie-DiC: a Movie Dialogue Corpus for Research and Development
13 0.21998131 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench
14 0.21984749 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction
15 0.21176597 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content
16 0.19878635 26 acl-2012-Applications of GPC Rules and Character Structures in Games for Learning Chinese Characters
17 0.19159304 113 acl-2012-INPROwidth.3emiSS: A Component for Just-In-Time Incremental Speech Synthesis
18 0.19029608 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora
19 0.18053897 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis
20 0.17910381 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
topicId topicWeight
[(25, 0.029), (26, 0.049), (28, 0.039), (30, 0.014), (37, 0.029), (39, 0.057), (59, 0.048), (74, 0.027), (82, 0.014), (84, 0.018), (85, 0.043), (90, 0.049), (92, 0.077), (94, 0.013), (98, 0.367), (99, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.81410974 24 acl-2012-A Web-based Evaluation Framework for Spatial Instruction-Giving Systems
Author: Srinivasan Janarthanam ; Oliver Lemon ; Xingkun Liu
Abstract: We demonstrate a web-based environment for development and testing of different pedestrian route instruction-giving systems. The environment contains a City Model, a TTS interface, a game-world, and a user GUI including a simulated street-view. We describe the environment and components, the metrics that can be used for the evaluation of pedestrian route instruction-giving systems, and the shared challenge which is being organised using this environment.
2 0.80435312 113 acl-2012-INPROwidth.3emiSS: A Component for Just-In-Time Incremental Speech Synthesis
Author: Timo Baumann ; David Schlangen
Abstract: We present a component for incremental speech synthesis (iSS) and a set of applications that demonstrate its capabilities. This component can be used to increase the responsivity and naturalness of spoken interactive systems. While iSS can show its full strength in systems that generate output incrementally, we also discuss how even otherwise unchanged systems may profit from its capabilities.
3 0.53073806 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
Author: Danilo Croce ; Alessandro Moschitti ; Roberto Basili ; Martha Palmer
Abstract: In this paper, we propose innovative representations for automatic classification of verbs according to mainstream linguistic theories, namely VerbNet and FrameNet. First, syntactic and semantic structures capturing essential lexical and syntactic properties of verbs are defined. Then, we design advanced similarity functions between such structures, i.e., semantic tree kernel functions, for exploiting distributional and grammatical information in Support Vector Machines. The extensive empirical analysis on VerbNet class and frame detection shows that our models capture mean- ingful syntactic/semantic structures, which allows for improving the state-of-the-art.
4 0.49718922 167 acl-2012-QuickView: NLP-based Tweet Search
Author: Xiaohua Liu ; Furu Wei ; Ming Zhou ; QuickView Team Microsoft
Abstract: Tweets have become a comprehensive repository for real-time information. However, it is often hard for users to quickly get information they are interested in from tweets, owing to the sheer volume of tweets as well as their noisy and informal nature. We present QuickView, an NLP-based tweet search platform to tackle this issue. Specifically, it exploits a series of natural language processing technologies, such as tweet normalization, named entity recognition, semantic role labeling, sentiment analysis, tweet classification, to extract useful information, i.e., named entities, events, opinions, etc., from a large volume of tweets. Then, non-noisy tweets, together with the mined information, are indexed, on top of which two brand new scenarios are enabled, i.e., categorized browsing and advanced search, allowing users to effectively access either the tweets or fine-grained information they are interested in.
Author: Eric Xing
Abstract: Probabilistic topic models have recently gained much popularity in informational retrieval and related areas. Via such models, one can project high-dimensional objects such as text documents into a low dimensional space where their latent semantics are captured and modeled; can integrate multiple sources of information—to ”share statistical strength” among components of a hierarchical probabilistic model; and can structurally display and classify the otherwise unstructured object collections. However, to many practitioners, how topic models work, what to and not to expect from a topic model, how is it different from and related to classical matrix algebraic techniques such as LSI, NMF in NLP, how to empower topic models to deal with complex scenarios such as multimodal data, contractual text in social media, evolving corpus, or presence of supervision such as labeling and rating, how to make topic modeling computationally tractable even on webscale data, etc., in a principled way, remain unclear. In this tutorial, I will demystify the conceptual, mathematical, and computational issues behind all such problems surrounding the topic models and their applications by presenting a systematic overview of the mathematical foundation of topic modeling, and its connections to a number of related methods popular in other fields such as the LDA, admixture model, mixed membership model, latent space models, and sparse coding. Iwill offer a simple and unifying view of all these techniques under the framework multi-view latent space embedding, and online the roadmap of model extension and algorithmic design to3 ward different applications in IR and NLP. A main theme of this tutorial that tie together a wide range of issues and problems will build on the ”probabilistic graphical model” formalism, a formalism that exploits the conjoined talents of graph theory and probability theory to build complex models out of simpler pieces. Iwill use this formalism as a main aid to discuss both the mathematical underpinnings for the models and the related computational issues in a unified, simplistic, transparent, and actionable fashion. Jeju, Republic of Korea,T 8ut Jourliya 2l0 A1b2s.tr ?ac c2t0s1 o2f A ACssLo 2c0ia1t2io,n p faogre C 3o,mputational Linguistics
6 0.32315412 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment
7 0.32274666 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars
8 0.32257774 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers
9 0.32147694 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments
10 0.31938776 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base
11 0.31599 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition
12 0.31428564 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning
13 0.31189698 139 acl-2012-MIX Is Not a Tree-Adjoining Language
14 0.31115851 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle
15 0.31003529 31 acl-2012-Authorship Attribution with Author-aware Topic Models
16 0.3097522 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation
17 0.3093282 48 acl-2012-Classifying French Verbs Using French and English Lexical Resources
18 0.30789685 205 acl-2012-Tweet Recommendation with Graph Co-Ranking
19 0.30590525 154 acl-2012-Native Language Detection with Tree Substitution Grammars
20 0.30368656 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP