acl acl2012 acl2012-129 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: S.R.K. Branavan ; Nate Kushman ; Tao Lei ; Regina Barzilay
Abstract: Comprehending action preconditions and effects is an essential step in modeling the dynamics of the world. In this paper, we express the semantics of precondition relations extracted from text in terms of planning operations. The challenge of modeling this connection is to ground language at the level of relations. This type of grounding enables us to create high-level plans based on language abstractions. Our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. We implement this idea in the reinforcement learning framework using feedback automatically obtained from plan execution attempts. When applied to a complex virtual world and text describing that world, our relation extraction technique performs on par with a supervised baseline, yielding an F-measure of 66% compared to the baseline’s 65%. Additionally, we show that a high-level planner utilizing these extracted relations significantly outperforms a strong, text unaware baseline successfully completing 80% of planning tasks as compared to 69% for the baseline.1 –
Reference: text
sentIndex sentText sentNum sentScore
1 edu , , , Abstract Comprehending action preconditions and effects is an essential step in modeling the dynamics of the world. [sent-6, score-0.32]
2 In this paper, we express the semantics of precondition relations extracted from text in terms of planning operations. [sent-7, score-1.096]
3 This type of grounding enables us to create high-level plans based on language abstractions. [sent-9, score-0.242]
4 Our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. [sent-10, score-1.109]
5 When applied to a complex virtual world and text describing that world, our relation extraction technique performs on par with a supervised baseline, yielding an F-measure of 66% compared to the baseline’s 65%. [sent-12, score-0.285]
6 Additionally, we show that a high-level planner utilizing these extracted relations significantly outperforms a strong, text unaware baseline successfully completing 80% of planning tasks as compared to 69% for the baseline. [sent-13, score-0.93]
7 1 – 1 Introduction Understanding action preconditions and effects is a basic step in modeling the dynamics of the world. [sent-14, score-0.32]
8 For example, having seeds is a precondition for growing wheat. [sent-15, score-0.485]
9 In NLP, precondition relations have been studied in terms of the linguistic mechanisms 1The code, data and experimental setup for this work are available at http://groups. [sent-18, score-0.601]
10 (a) Low Level Actions for: wood → pickaxe → stone Figure 1: Text description of preconditions and effects (a), and the low-level actions connecting them (b). [sent-22, score-0.563]
11 In this paper, we bring these two parallel views together, grounding the linguistic realization of these relations in the semantics of planning operations. [sent-24, score-0.726]
12 The challenge and opportunity of this fusion comes from the mismatch between the abstractions of human language and the granularity of planning primitives. [sent-25, score-0.453]
13 Consider, for example, text describing a virtual world such as Minecraft2 and a formal description of that world using planning primitives. [sent-26, score-0.707]
14 Due to the mismatch in granularity, even the simple relations between wood, pickaxe and stone described in the sentence in Figure 1a results in dozens of lowlevel planning actions in the world, as can be seen in Figure 1b. [sent-27, score-0.901]
15 On the other hand, planning with low-level actions does not suffer from this limitation, but is computationally intractable for even moderately complex tasks. [sent-29, score-0.511]
16 As a consequence, in many practical domains, planning algorithms rely on manually-crafted high-level 2http://www. [sent-30, score-0.413]
17 The central idea of our work is to express the semantics of precondition relations extracted from text in terms of planning operations. [sent-36, score-1.096]
18 For instance, the precondition relation between pickaxe and stone described in the sentence in Figure 1a indicates that plans which involve obtaining stone will likely need to first obtain a pickaxe. [sent-37, score-0.902]
19 We build on the intuition that the validity of precondition relations extracted from text can be informed by the execution of a low-level planner. [sent-39, score-0.671]
20 Moreover, we can use the learned relations to guide a high level planner and ultimately improve planning performance. [sent-41, score-0.859]
21 We implement these ideas in the reinforcement learning framework, wherein our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. [sent-42, score-1.202]
22 For a given planning task and a set of candidate relations, our model repeatedly predicts a sequence of subgoals where each subgoal specifies an attribute of the world that must be made true. [sent-43, score-1.049]
23 It then asks the low-level planner to find a plan between each consecutive pair of subgoals in the sequence. [sent-44, score-0.558]
24 The observed feedback whether the lowlevel planner succeeded or failed at each step is utilized to update the policy for both text analysis and high-level planning. [sent-45, score-0.551]
25 – – Our results demonstrate the strength of our relation extraction technique while using planning feedback as its only source of supervision, it achieves a precondition relation extraction accuracy on par with that of a supervised SVM baseline. [sent-47, score-1.073]
26 As baselines – 3If a planner can find a plan to successfully obtain stone after obtaining a pickaxe, then a pickaxe is likely a precondition for stone. [sent-50, score-1.101]
27 Conversely, if a planner obtains stone without first obtaining a pickaxe, then it is likely not a precondition. [sent-51, score-0.407]
28 127 for this evaluation, we employ the Metric-FF planner (Hoffmann and Nebel, 2001),4 as well as a textunaware variant of our model. [sent-52, score-0.33]
29 Our results show that our text-driven high-level planner significantly outperforms all baselines in terms of completed planning tasks it successfully solves 80% as compared to 41% for the Metric-FF planner and 69% for the text unaware variant of our model. [sent-53, score-1.14]
30 In fact, the performance of our method approaches that of an oracle planner which uses manually-annotated precon– ditions. [sent-54, score-0.33]
31 2 Related Work Extracting Event Semantics from Text The task of extracting preconditions and effects has previously been addressed in the context of lexical semantics (Sil et al. [sent-55, score-0.254]
32 However, our only source of supervision is the feedback provided by the planning task which utilizes the predictions. [sent-62, score-0.484]
33 de/ Text (input): A pickaxe , which is used to ha rvest stone , can be made f rom wood . [sent-75, score-0.253]
34 Precondition Relations: woodpickaxe pickaxestone isnPtial eanlSubgoa(sulwboSgeoadlq1)uenc:(psiucbgkoaxl2e)s(gtoanl)e Figure 2: A high-level plan showing two subgoals in a precondition relation. [sent-76, score-0.713]
35 From this text, it extracts relations between objects in the world which hold independently of any given task. [sent-85, score-0.267]
36 Task-specific solutions are then constructed by a planner that relies on these relations to perform effective high-level planning. [sent-86, score-0.446]
37 Hierarchical Planning It is widely accepted that high-level plans that factorize a planning problem can greatly reduce the corresponding search space (Newell et al. [sent-87, score-0.5]
38 Previous work in planning has studied the theoretical properties of valid abstractions and proposed a number of techniques for generating them (Jonsson and Barto, 2005; Wolfe and Barto, 2005; Mehta et al. [sent-89, score-0.498]
39 As input, we are given a world defined by the tuple hS, A, Ti, where S is the set ofpossible world states, hAS ,isA t,heT sie, tw ohfe possible eacsetiotonfs paonsds iTbl eisw a drledtesrtamtiens,istic state transition function. [sent-99, score-0.247]
40 Given our definition of the world state, preconditions and effects are merely single term predicates, xi, in this world state. [sent-106, score-0.418]
41 Thus, for each predicate pair hxk, xli, we want to utilize the text to predict pwahierth hxer xk i,s a precondition fizoer xl; i. [sent-108, score-0.619]
42 For example, from the text in Figure 2, we want to predict that possessing a pickaxe is a precondition for possessing stone. [sent-111, score-0.776]
43 Each planning goal g ∈ G is defined by a starting state s0g, alnandn a fign galo goal state sfg. [sent-115, score-0.547]
44 In the planning part of our task our objective is to find a sequence of actions that connect s0g a to sfg. [sent-117, score-0.542]
45 We define a highlevel plan as a sequence of subgoals, where each subgoal is represented by a single-term predicate, xi, that needs to be set in the corresponding world state e. [sent-120, score-0.535]
46 Thus the set of possible subgoals is defined by the set of all possi– ble single-term predicates in the domain. [sent-123, score-0.247]
47 Our algorithm for textually informed high-level planning operates in four steps: 1. [sent-125, score-0.413]
48 Use text to predict the preconditions of each subgoal. [sent-126, score-0.28]
49 Given a planning goal and the induced preconditions, predict a subgoal sequence that achieves the given goal. [sent-129, score-0.813]
50 Execute the predicted sequence by giving each pair of consecutive subgoals to a low-level planner. [sent-131, score-0.246]
51 This planner, treated as a black-box, computes the low-level plan actions necessary to transition from one subgoal to the next. [sent-132, score-0.432]
52 Modeling Precondition Relations Given a document d, and a set of subgoal pairs hxi, xji, we want tmo predict dwh ae stheter o subgoal xi iisr a precondition fanort xj. [sent-136, score-1.166]
53 We assume that precondition relations are generally described within single sentences. [sent-137, score-0.601]
54 We first use our seed grounding in a preprocessing step where we extract all predicate pairs where both predicates are mentioned in the same sentence. [sent-138, score-0.294]
55 Note that this set will contain many invalid relations since co-occurrence in a sentence does not necessarily imply a valid precondition relation. [sent-140, score-0.68]
56 129 Input: A document d, Set of planning tasks G, Set of candidate precondition relations Call, Reward function r(), Number of iterations T Initialization:Model parameters θx = 0 and θc = 0. [sent-144, score-1.046]
57 for i= 1· · · T do rSa im =p 1le· v·a·Tlid d poreconditions: C ∅ ← fCor ←eac ∅h hxi, xji ∈ Call do rfoearecahc hhx Sentein ∈ce C w~ k containing xi and xj do v ∼ p(xi → xj | w~ k , qk ; θc) ivf v = p( 1x th→en x C |= w~ C ∪ hxi, xji enidf end Predict subgoal sequences for each task g. [sent-145, score-0.529]
58 foreach g ∈ G do rSeaamcphle g s ∈u bGgo daol sequence as follows: for t = 1· · · n do rSa tm =p 1le· n··exnt dsuobgoal: xt ∼ p(x | xt−1 , s0g, sfg , C; θx) Construct low-level subtask from xt−1 to xt Execute low-level planner on subtask end Update subgoal prediction model using Eqn. [sent-146, score-0.845]
59 Given these per-sentence decisions, we predict the set of all valid precondition relations, C, in a deterministic fashion. [sent-150, score-0.585]
60 We do this by considering a precondition xi → xj as valid if it is predicted to be valid by at least one sentence. [sent-151, score-0.691]
61 Modeling Subgoal Sequences Given a planning goal g, defined by initial and final goal states s0g and sfg, our task is to predict a sequence of subgoals which will achieve the goal. [sent-152, score-0.733]
62 We condition this decision on our predicted set of valid preconditions C, by modeling the distribution over sequences as: x x Yn p(~ x | s0g, sfg, C; θx) = Yp(xt | xt−1, s0g, sfg, C; θx), tY= Y1 p(xt | xt−1,sg0,sfg,C;θx) ∝ eθx·φx(xt,xt−1,s0g,sfg,C). [sent-153, score-0.263]
63 Here we assume that subgoal sequences are Markovian in nature and model individual subgoal predictions using a log-linear model. [sent-154, score-0.608]
64 Additionally, we assume a special stop symbol, x∅, which indicates the end of the subgoal sequence. [sent-157, score-0.288]
65 Specifically, once the model has predicted a subgoal sequence for a given goal, the sequence is given to the low-level planner for execution. [sent-159, score-0.713]
66 Therefore, subgoal pairs hxk, xli, where xl is reachable from xk, are given positive , re wwhaerred. [sent-168, score-0.316]
67 (2) The objective of the planning component of our model is to predict subgoal sequences that successfully achieve the given planning goals. [sent-174, score-1.169]
68 Thus we directly use plan-success as a binary reward signal, which is applied to each subgoal decision in a sequence. [sent-175, score-0.344]
69 , where t indexes into the subgoal sequence and the learning rate. [sent-178, score-0.319]
70 (3) αx is 130 wspfeloatni c odk ebosne dmseafilshinsgtrsi nhogedarswfoisorholnirodno rbmucilk et Figure 3: Example of the precondition dependencies present in the Minecraft domain. [sent-179, score-0.485]
71 Defining the Domain In order to execute a tradi- tional planner on the Minecraft domain, we define the domain using the Planning Domain Definition Language (PDDL) (Fox and Long, 2003). [sent-187, score-0.408]
72 Our subgoals xi and our task goals sfg map directly to these predicates. [sent-190, score-0.308]
73 Table 1 compares the complexity of our domain with some typical planning domains used in the IPC. [sent-192, score-0.455]
74 org/ Low-level Planner As our low-level planner we employ Metric-FF (Hoffmann and Nebel, 2001), the state-of-the-art baseline used in the 2008 International Planning Competition. [sent-195, score-0.361]
75 The sequence prediction component takes as input both the preconditions induced by the text component as well as the planning state and the previous subgoal. [sent-204, score-0.71]
76 Thus φx contains features which check whether two subgoals are connected via an induced precondition relation, in addition to features which are simply the Cartesian product of domain predicates. [sent-205, score-0.709]
77 Our manually constructed seed grounding of predicates contains 74 entries, examples of which can be seen in Table 3. [sent-207, score-0.255]
78 We use this seed grounding to identify a set of 242 sentences that reference predicates in the Minecraft domain. [sent-208, score-0.255]
79 tion of the sequence of low-level plans takes 35 actions, with 3 actions for the shortest plan and 123 actions for the longest. [sent-222, score-0.36]
80 For evaluation purposes we manually identify a set of Gold Relations consisting of all precondition relations that are valid in this domain, including those not discussed in the text. [sent-225, score-0.646]
81 This evaluation measure simply assesses whether the planner completes a task within a predefined time. [sent-228, score-0.33]
82 The FF baseline directly runs the Metric-FF planner on the given task, while the No Text baseline is a variant of our model that learns to plan in the reinforcement learning framework. [sent-233, score-0.531]
83 Seeds for growing wheat can ✘b ✘e (ofbaltseai nneegd abtyiv eb)reaking tal grass Figure 4: Examples of precondition relations predicted by our model from text. [sent-236, score-0.664]
84 Figure 5: The performance of our model and a supervised SVM baseline on the precondition prediction task. [sent-239, score-0.541]
85 Each of these trials is run for 200 learning iterations with a maximum subgoal sequence length of 10. [sent-250, score-0.319]
86 To find a low-level plan between each consecutive pair of subgoals, our high-level planner internally uses MetricFF. [sent-251, score-0.376]
87 This is an upper bound on the time that our high-level planner can take over the 200 learning iterations, with subgoal sequences of length at most 10 and a one minute timeout. [sent-257, score-0.618]
88 These results support our hypothesis that planning feedback is a powerful source of supervision for analyzing a given text corpus. [sent-264, score-0.524]
89 Planning Performance As shown in Table 4 our text-enriched planning model outperforms the textfree baselines by more than 10%. [sent-266, score-0.44]
90 35% between Manual Text and Gold shows the importance of the precondition information that is missing from the text. [sent-271, score-0.485]
91 9 Figure 6 breaks down the results based on the difficulty of the corresponding planning task. [sent-273, score-0.413]
92 However, performance varies greatly on the more challenging tasks, directly correlating with planner sophistication. [sent-277, score-0.33]
93 Both models picked up on the words that indicate precondition relations in this domain. [sent-280, score-0.601]
94 133 p a t h h a s dw eo prde n "dfcuirelsna"ecfty" pe "xdsoubj" Figure7:p Ta ht he hta os p dw eofiprved n"duecpsqronaeucsf"tiay ltsvy"pe f"xpasturbmejs"od nwordsan dependency types learned by our model (above) and by SVM (below) for precondition prediction. [sent-284, score-0.485]
95 8 Conclusions In this paper, we presented a novel technique for inducing precondition relations from text by grounding them in the semantics of planning operations. [sent-289, score-1.251]
96 While using planning feedback as its only source of supervision, our method for relation extraction achieves a performance on par with that of a supervised baseline. [sent-290, score-0.548]
97 Furthermore, relation grounding provides a new view on classical planning problems which enables us to create high-level plans based on language abstractions. [sent-291, score-0.695]
98 1: An extension to pddl for expressing temporal planning domains. [sent-359, score-0.443]
99 The FF planning system: Fast plan generation through heuristic search. [sent-373, score-0.459]
100 Identifying useful subgoals in reinforcement learning by local graph partitioning. [sent-459, score-0.275]
wordName wordTfidf (topN-words)
[('precondition', 0.485), ('planning', 0.413), ('planner', 0.33), ('subgoal', 0.288), ('preconditions', 0.185), ('subgoals', 0.182), ('grounding', 0.155), ('minecraft', 0.136), ('pickaxe', 0.136), ('relations', 0.116), ('branavan', 0.106), ('world', 0.103), ('actions', 0.098), ('reinforcement', 0.093), ('plans', 0.087), ('stone', 0.077), ('sfg', 0.076), ('sutton', 0.066), ('predicates', 0.065), ('action', 0.063), ('barto', 0.061), ('craft', 0.061), ('lowlevel', 0.061), ('sil', 0.061), ('xt', 0.06), ('reward', 0.056), ('predict', 0.055), ('xi', 0.05), ('virtual', 0.048), ('objects', 0.048), ('plan', 0.046), ('beamer', 0.045), ('blanco', 0.045), ('hxi', 0.045), ('qk', 0.045), ('causal', 0.045), ('valid', 0.045), ('dynamics', 0.045), ('policy', 0.043), ('girju', 0.043), ('semantics', 0.042), ('domain', 0.042), ('feedback', 0.041), ('state', 0.041), ('relation', 0.04), ('text', 0.04), ('ff', 0.04), ('abstractions', 0.04), ('xji', 0.04), ('wood', 0.04), ('predicate', 0.039), ('causality', 0.036), ('execute', 0.036), ('update', 0.036), ('regina', 0.036), ('svm', 0.036), ('seed', 0.035), ('invalid', 0.034), ('gradient', 0.034), ('predicted', 0.033), ('xj', 0.033), ('candidate', 0.032), ('situated', 0.032), ('instructions', 0.032), ('predictions', 0.032), ('sequence', 0.031), ('baseline', 0.031), ('event', 0.03), ('execution', 0.03), ('avirup', 0.03), ('bacchus', 0.03), ('jonsson', 0.03), ('kaelbling', 0.03), ('lekav', 0.03), ('nebel', 0.03), ('newell', 0.03), ('pddl', 0.03), ('planners', 0.03), ('possessing', 0.03), ('wheat', 0.03), ('wolfe', 0.03), ('xli', 0.03), ('supervision', 0.03), ('vogel', 0.029), ('par', 0.029), ('hoffmann', 0.029), ('xl', 0.028), ('manual', 0.028), ('effects', 0.027), ('baselines', 0.027), ('iron', 0.026), ('hxk', 0.026), ('fleischman', 0.026), ('mehta', 0.026), ('ghallab', 0.026), ('highlevel', 0.026), ('leslie', 0.026), ('goal', 0.026), ('supervised', 0.025), ('gold', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 129 acl-2012-Learning High-Level Planning from Text
Author: S.R.K. Branavan ; Nate Kushman ; Tao Lei ; Regina Barzilay
Abstract: Comprehending action preconditions and effects is an essential step in modeling the dynamics of the world. In this paper, we express the semantics of precondition relations extracted from text in terms of planning operations. The challenge of modeling this connection is to ground language at the level of relations. This type of grounding enables us to create high-level plans based on language abstractions. Our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. We implement this idea in the reinforcement learning framework using feedback automatically obtained from plan execution attempts. When applied to a complex virtual world and text describing that world, our relation extraction technique performs on par with a supervised baseline, yielding an F-measure of 66% compared to the baseline’s 65%. Additionally, we show that a high-level planner utilizing these extracted relations significantly outperforms a strong, text unaware baseline successfully completing 80% of planning tasks as compared to 69% for the baseline.1 –
2 0.18842202 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments
Author: Luciana Benotti ; Martin Villalba ; Tessa Lau ; Julian Cerruti
Abstract: Previous approaches to instruction interpretation have required either extensive domain adaptation or manually annotated corpora. This paper presents a novel approach to instruction interpretation that leverages a large amount of unannotated, easy-to-collect data from humans interacting with a virtual world. We compare several algorithms for automatically segmenting and discretizing this data into (utterance, reaction) pairs and training a classifier to predict reactions given the next utterance. Our empirical analysis shows that the best algorithm achieves 70% accuracy on this task, with no manual annotation required. 1 Introduction and motivation Mapping instructions into automatically executable actions would enable the creation of natural lan- , guage interfaces to many applications (Lau et al., 2009; Branavan et al., 2009; Orkin and Roy, 2009). In this paper, we focus on the task of navigation and manipulation of a virtual environment (Vogel and Jurafsky, 2010; Chen and Mooney, 2011). Current symbolic approaches to the problem are brittle to the natural language variation present in instructions and require intensive rule authoring to be fit for a new task (Dzikovska et al., 2008). Current statistical approaches require extensive manual annotations of the corpora used for training (MacMahon et al., 2006; Matuszek et al., 2010; Gorniak and Roy, 2007; Rieser and Lemon, 2010). Manual annotation and rule authoring by natural language engineering experts are bottlenecks for developing conversational systems for new domains. 181 t e s s al au @ us . ibm . com, j ce rrut i ar .ibm . com @ This paper proposes a fully automated approach to interpreting natural language instructions to complete a task in a virtual world based on unsupervised recordings of human-human interactions perform- ing that task in that virtual world. Given unannotated corpora collected from humans following other humans’ instructions, our system automatically segments the corpus into labeled training data for a classification algorithm. Our interpretation algorithm is based on the observation that similar instructions uttered in similar contexts should lead to similar actions being taken in the virtual world. Given a previously unseen instruction, our system outputs actions that can be directly executed in the virtual world, based on what humans did when given similar instructions in the past. 2 Corpora situated in virtual worlds Our environment consists of six virtual worlds designed for the natural language generation shared task known as the GIVE Challenge (Koller et al., 2010), where a pair of partners must collaborate to solve a task in a 3D space (Figure 1). The “instruction follower” (IF) can move around in the virtual world, but has no knowledge of the task. The “instruction giver” (IG) types instructions to the IF in order to guide him to accomplish the task. Each corpus contains the IF’s actions and position recorded every 200 milliseconds, as well as the IG’s instruc- tions with their timestamps. We used two corpora for our experiments. The Cm corpus (Gargett et al., 2010) contains instructions given by multiple people, consisting of 37 games spanning 2163 instructions over 8: 17 hs. The Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 8s1–186, Figure 1: A screenshot of a virtual world. The world consists of interconnecting hallways, rooms and objects Cs corpus (Benotti and Denis, 2011), gathered using a single IG, is composed of 63 games and 3417 in- structions, and was recorded in a span of 6:09 hs. It took less than 15 hours to collect the corpora through the web and the subjects reported that the experiment was fun. While the environment is restricted, people describe the same route and the same objects in extremely different ways. Below are some examples of instructions from our corpus all given for the same route shown in Figure 1. 1) out 2) walk down the passage 3) nowgo [sic] to the pink room 4) back to the room with the plant 5) Go through the door on the left 6) go through opening with yellow wall paper People describe routes using landmarks (4) or specific actions (2). They may describe the same object differently (5 vs 6). Instructions also differ in their scope (3 vs 1). Thus, even ignoring spelling and grammatical errors, navigation instructions contain considerable variation which makes interpreting them a challenging problem. 3 Learning from previous interpretations Our algorithm consists of two phases: annotation and interpretation. Annotation is performed only once and consists of automatically associating each IG instruction to an IF reaction. Interpretation is performed every time the system receives an instruc182 tion and consists of predicting an appropriate reaction given reactions observed in the corpus. Our method is based on the assumption that a reaction captures the semantics of the instruction that caused it. Therefore, if two utterances result in the same reaction, they are paraphrases of each other, and similar utterances should generate the same reaction. This approach enables us to predict reactions for previously-unseen instructions. 3.1 Annotation phase The key challenge in learning from massive amounts of easily-collected data is to automatically annotate an unannotated corpus. Our annotation method consists of two parts: first, segmenting a low-level interaction trace into utterances and corresponding reactions, and second, discretizing those reactions into canonical action sequences. Segmentation enables our algorithm to learn from traces of IFs interacting directly with a virtual world. Since the IF can move freely in the virtual world, his actions are a stream of continuous behavior. Segmentation divides these traces into reactions that follow from each utterance of the IG. Consider the following example starting at the situation shown in Figure 1: IG(1): go through the yellow opening IF(2): [walks out of the room] IF(3): [turns left at the intersection] IF(4): [enters the room with the sofa] IG(5): stop It is not clear whether the IF is doing h3, 4i because h neo tis c reacting htoe r1 t or Fbec isadu soei hge h 3is, being proactive. While one could manually annotate this data to remove extraneous actions, our goal is to develop automated solutions that enable learning from massive amounts of data. We decided to approach this problem by experimenting with two alternative formal definitions: 1) a strict definition that considers the maximum reaction according to the IF behavior, and 2) a loose defini- tion based on the empirical observation that, in situated interaction, most instructions are constrained by the current visually perceived affordances (Gibson, 1979; Stoia et al., 2006). We formally define behavior segmentation (Bhv) as follows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered. In the example, instruction 1corresponds to h2, 3, 4i . We formally d inefsitnrue visibility segmentation (Vis) as f Wolelows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered or right after the IF leaves the area visible at 360◦ from where uk was uttered. In the example, instruction 1’s reaction would be limited to h2i because the intersection is nwootu vldisi bbele l ifmroimte dw htoer he2 tihe b eicnasutrsuec ttihoen was suetctetiroend. The Bhv and Vis methods define how to segment an interaction trace into utterances and their corresponding reactions. However, users frequently perform noisy behavior that is irrelevant to the goal of the task. For example, after hearing an instruction, an IF might go into the wrong room, realize the error, and leave the room. A reaction should not in- clude such irrelevant actions. In addition, IFs may accomplish the same goal using different behaviors: two different IFs may interpret “go to the pink room” by following different paths to the same destination. We would like to be able to generalize both reactions into one canonical reaction. As a result, our approach discretizes reactions into higher-level action sequences with less noise and less variation. Our discretization algorithm uses an automated planner and a planning representation of the task. This planning representation includes: (1) the task goal, (2) the actions which can be taken in the virtual world, and (3) the current state of the virtual world. Using the planning representation, the planner calculates an optimal path between the starting and ending states of the reaction, eliminating all unnecessary actions. While we use the classical planner FF (Hoffmann, 2003), our technique could also work with classical planning (Nau et al., 2004) or other techniques such as probabilistic planning (Bonet and Geffner, 2005). It is also not dependent on a particular discretization of the world in terms of actions. Now we are ready to define canonical reaction ck formally. Let Sk be the state of the virtual world when instruction uk was uttered, Sk+1 be the state of the world where the reaction ends (as defined by Bhv or Vis segmentation), and D be the planning domain representation of the virtual world. The canonical reaction to uk is defined as the sequence of actions 183 returned by the planner with Sk as initial state, Sk+1 as goal state and D as planning domain. 3.2 Interpretation phase The annotation phase results in a collection of (uk, ck) pairs. The interpretation phase uses these pairs to interpret new utterances in three steps. First, we filter the set of pairs into those whose reactions can be directly executed from the current IF position. Second, we group the filtered pairs according to their reactions. Third, we select the group with utterances most similar to the new utterance, and output that group’s reaction. Figure 2 shows the output of the first two steps: three groups of pairs whose reactions can all be executed from the IF’s current position. Figure 2: Utterance groups for this situation. Colored arrows show the reaction associated with each group. We treat the third step, selecting the most similar group for a new utterance, as a classification problem. We compare three different classification methods. One method uses nearest-neighbor classification with three different similarity metrics: Jaccard and Overlap coefficients (both of which measure the degree of overlap between two sets, differing only in the normalization of the final value (Nikravesh et al., 2005)), and Levenshtein Distance (a string met- ric for measuring the amount of differences between two sequences of words (Levenshtein, 1966)). Our second classification method employs a strategy in which we considered each group as a set of possible machine translations of our utterance, using the BLEU measure (Papineni et al., 2002) to select which group could be considered the best translation of our utterance. Finally, we trained an SVM classifier (Cortes and Vapnik, 1995) using the unigrams Corpus Cm Corpus Cs Algorithm Bhv Vis Bhv Vis Jaccard47%54%54%70% Overlap BLEU SVM Levenshtein 43% 44% 33% 21% 53% 52% 29% 20% 45% 54% 45% 8% 60% 50% 29% 17% Table 1: Accuracy comparison between Cm and Cs for Bhv and Vis segmentation of each paraphrase and the position of the IF as features, and setting their group as the output class using a libSVM wrapper (Chang and Lin, 2011). When the system misinterprets an instruction we use a similar approach to what people do in order to overcome misunderstandings. If the system executes an incorrect reaction, the IG can tell the system to cancel its current interpretation and try again using a paraphrase, selecting a different reaction. 4 Evaluation For the evaluation phase, we annotated both the Cm and Cs corpora entirely, and then we split them in an 80/20 proportion; the first 80% of data collected in each virtual world was used for training, while the remaining 20% was used for testing. For each pair (uk, ck) in the testing set, we used our algorithm to predict the reaction to the selected utterance, and then compared this result against the automatically annotated reaction. Table 1 shows the results. Comparing the Bhv and Vis segmentation strategies, Vis tends to obtain better results than Bhv. In addition, accuracy on the Cs corpus was generally higher than Cm. Given that Cs contained only one IG, we believe this led to less variability in the instructions and less noise in the training data. We evaluated the impact of user corrections by simulating them using the existing corpus. In case of a wrong response, the algorithm receives a second utterance with the same reaction (a paraphrase of the previous one). Then the new utterance is tested over the same set of possible groups, except for the one which was returned before. If the correct reaction is not predicted after four tries, or there are no utterances with the same reaction, the predictions are registered as wrong. To measure the effects of user corrections vs. without, we used a different evalu184 ation process for this algorithm: first, we split the corpus in a 50/50 proportion, and then we moved correctly predicted utterances from the testing set towards training, until either there was nothing more to learn or the training set reached 80% of the entire corpus size. As expected, user corrections significantly improve accuracy, as shown in Figure 3. The worst algorithm’s results improve linearly with each try, while the best ones behave asymptotically, barely improving after the second try. The best algorithm reaches 92% with just one correction from the IG. 5 Discussion and future work We presented an approach to instruction interpretation which learns from non-annotated logs of human behavior. Our empirical analysis shows that our best algorithm achieves 70% accuracy on this task, with no manual annotation required. When corrections are added, accuracy goes up to 92% for just one correction. We consider our results promising since state of the art semi-unsupervised approaches to instruction interpretation (Chen and Mooney, 2011) reports a 55% accuracy on manually segmented data. We plan to compare our system’s performance against human performance in comparable situations. Our informal observations of the GIVE corpus indicate that humans often follow instructions incorrectly, so our automated system’s performance may be on par with human performance. Although we have presented our approach in the context of 3D virtual worlds, we believe our technique is also applicable to other domains such as the web, video games, or Human Robot Interaction. Figure 3: Accuracy values with corrections over Cs References Luciana Benotti and Alexandre Denis. 2011. CL system: Giving instructions by corpus based selection. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 296–301, Nancy, France, September. Association for Computational Linguistics. Blai Bonet and H ´ector Geffner. 2005. mGPT: a probabilistic planner based on heuristic search. Journal of Artificial Intelligence Research, 24:933–944. S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 82–90, Suntec, Singapore, August. Association for Computational Linguistics. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27: 1– 27:27. Software available at http : / /www . cs ie . ntu .edu .tw/ ˜ c j l in/ l ibsvm. David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI2011), pages 859–865, August. Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning, 20:273–297. Myroslava O. Dzikovska, James F. Allen, and Mary D. Swift. 2008. Linking semantic and knowledge representations in a multi-domain dialogue system. Journal of Logic and Computation, 18:405–430, June. Andrew Gargett, Konstantina Garoufi, Alexander Koller, and Kristina Striegnitz. 2010. The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC), Malta. James J. Gibson. 1979. The Ecological Approach to Visual Perception, volume 40. Houghton Mifflin. Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cognitive Science, 3 1(2): 197–231. J o¨rg Hoffmann. 2003. The Metric-FF planning system: Translating ”ignoring delete lists” to numeric state variables. Journal of Artificial Intelligence Research (JAIR), 20:291–341. Alexander Koller, Kristina Striegnitz, Andrew Gargett, Donna Byron, Justine Cassell, Robert Dale, Johanna Moore, and Jon Oberlander. 2010. Report on the second challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th In185 ternational Natural Language Generation Conference (INLG), Dublin. Tessa Lau, Clemens Drews, and Jeffrey Nichols. 2009. Interpreting written how-to instructions. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1433–1438, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artifi- cial Intelligence - Volume 2, pages 1475–1482. AAAI Press. Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction, HRI ’ 10, pages 251–258, New York, NY, USA. ACM. Dana Nau, Malik Ghallab, and Paolo Traverso. 2004. Automated Planning: Theory & Practice. Morgan Kaufmann Publishers Inc., California, USA. Masoud Nikravesh, Tomohiro Takagi, Masanori Tajima, Akiyoshi Shinmura, Ryosuke Ohgaya, Koji Taniguchi, Kazuyosi Kawahara, Kouta Fukano, and Akiko Aizawa. 2005. Soft computing for perception-based decision processing and analysis: Web-based BISCDSS. In Masoud Nikravesh, Lotfi Zadeh, and Janusz Kacprzyk, editors, Soft Computing for Information Processing and Analysis, volume 164 of Studies in Fuzziness and Soft Computing, chapter 4, pages 93– 188. Springer Berlin / Heidelberg. Jeff Orkin and Deb Roy. 2009. Automatic learning and generation of social behavior from collective human gameplay. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent SystemsVolume 1, volume 1, pages 385–392. International Foundation for Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 3 11–3 18, Stroudsburg, PA, USA. Association for Computational Linguistics. Verena Rieser and Oliver Lemon. 2010. Learning human multimodal dialogue strategies. Natural Language Engineering, 16:3–23. Laura Stoia, Donna K. Byron, Darla Magdalene Shockley, and Eric Fosler-Lussier. 2006. Sentence planning for realtime navigational instructions. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACLShort ’06, pages 157–160, Stroudsburg, PA, USA. Association for Computational Linguistics. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings ofthe 48th Annual Meeting of the Association for Computational Linguistics, ACL ’ 10, pages 806–814, Stroudsburg, PA, USA. Association for Computational Linguistics. 186
3 0.076765627 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model
Author: Enrique Alfonseca ; Katja Filippova ; Jean-Yves Delort ; Guillermo Garrido
Abstract: We describe the use of a hierarchical topic model for automatically identifying syntactic and lexical patterns that explicitly state ontological relations. We leverage distant supervision using relations from the knowledge base FreeBase, but do not require any manual heuristic nor manual seed list selections. Results show that the learned patterns can be used to extract new relations with good precision.
4 0.076107986 51 acl-2012-Collective Generation of Natural Image Descriptions
Author: Polina Kuznetsova ; Vicente Ordonez ; Alexander Berg ; Tamara Berg ; Yejin Choi
Abstract: We present a holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web. More specifically, given a query image, we retrieve existing human-composed phrases used to describe visually similar images, then selectively combine those phrases to generate a novel description for the query image. We cast the generation process as constraint optimization problems, collectively incorporating multiple interconnected aspects of language composition for content planning, surface realization and discourse structure. Evaluation by human annotators indicates that our final system generates more semantically correct and linguistically appealing descriptions than two nontrivial baselines.
5 0.068471812 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition
Author: David Chen
Abstract: Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (201 1) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the stateof-the-art results. We show that by changing the grammar of the formal meaning represen- . tation language and training on additional data collected from Amazon’s Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
6 0.064726032 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places
7 0.064631782 64 acl-2012-Crosslingual Induction of Semantic Roles
8 0.064353257 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations
9 0.053196236 78 acl-2012-Efficient Search for Transformation-based Inference
10 0.051192742 191 acl-2012-Temporally Anchored Relation Extraction
11 0.048816286 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling
12 0.048812605 73 acl-2012-Discriminative Learning for Joint Template Filling
13 0.048157118 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures
14 0.047485147 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive
15 0.046402026 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT
16 0.044608019 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation
17 0.043689657 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction
18 0.04302888 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums
19 0.042994015 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content
20 0.042183731 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining
topicId topicWeight
[(0, -0.139), (1, 0.068), (2, -0.047), (3, 0.047), (4, -0.002), (5, 0.034), (6, -0.023), (7, 0.029), (8, -0.042), (9, 0.018), (10, -0.014), (11, 0.05), (12, -0.096), (13, 0.001), (14, -0.102), (15, 0.032), (16, -0.103), (17, 0.021), (18, 0.028), (19, 0.074), (20, 0.034), (21, -0.039), (22, -0.022), (23, 0.113), (24, -0.007), (25, -0.001), (26, -0.0), (27, 0.08), (28, -0.047), (29, -0.015), (30, -0.041), (31, -0.027), (32, 0.016), (33, 0.03), (34, -0.014), (35, 0.014), (36, -0.033), (37, -0.109), (38, -0.029), (39, -0.077), (40, -0.11), (41, 0.019), (42, -0.055), (43, -0.056), (44, 0.005), (45, 0.132), (46, -0.029), (47, -0.098), (48, 0.038), (49, -0.11)]
simIndex simValue paperId paperTitle
same-paper 1 0.88364375 129 acl-2012-Learning High-Level Planning from Text
Author: S.R.K. Branavan ; Nate Kushman ; Tao Lei ; Regina Barzilay
Abstract: Comprehending action preconditions and effects is an essential step in modeling the dynamics of the world. In this paper, we express the semantics of precondition relations extracted from text in terms of planning operations. The challenge of modeling this connection is to ground language at the level of relations. This type of grounding enables us to create high-level plans based on language abstractions. Our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. We implement this idea in the reinforcement learning framework using feedback automatically obtained from plan execution attempts. When applied to a complex virtual world and text describing that world, our relation extraction technique performs on par with a supervised baseline, yielding an F-measure of 66% compared to the baseline’s 65%. Additionally, we show that a high-level planner utilizing these extracted relations significantly outperforms a strong, text unaware baseline successfully completing 80% of planning tasks as compared to 69% for the baseline.1 –
2 0.77658188 59 acl-2012-Corpus-based Interpretation of Instructions in Virtual Environments
Author: Luciana Benotti ; Martin Villalba ; Tessa Lau ; Julian Cerruti
Abstract: Previous approaches to instruction interpretation have required either extensive domain adaptation or manually annotated corpora. This paper presents a novel approach to instruction interpretation that leverages a large amount of unannotated, easy-to-collect data from humans interacting with a virtual world. We compare several algorithms for automatically segmenting and discretizing this data into (utterance, reaction) pairs and training a classifier to predict reactions given the next utterance. Our empirical analysis shows that the best algorithm achieves 70% accuracy on this task, with no manual annotation required. 1 Introduction and motivation Mapping instructions into automatically executable actions would enable the creation of natural lan- , guage interfaces to many applications (Lau et al., 2009; Branavan et al., 2009; Orkin and Roy, 2009). In this paper, we focus on the task of navigation and manipulation of a virtual environment (Vogel and Jurafsky, 2010; Chen and Mooney, 2011). Current symbolic approaches to the problem are brittle to the natural language variation present in instructions and require intensive rule authoring to be fit for a new task (Dzikovska et al., 2008). Current statistical approaches require extensive manual annotations of the corpora used for training (MacMahon et al., 2006; Matuszek et al., 2010; Gorniak and Roy, 2007; Rieser and Lemon, 2010). Manual annotation and rule authoring by natural language engineering experts are bottlenecks for developing conversational systems for new domains. 181 t e s s al au @ us . ibm . com, j ce rrut i ar .ibm . com @ This paper proposes a fully automated approach to interpreting natural language instructions to complete a task in a virtual world based on unsupervised recordings of human-human interactions perform- ing that task in that virtual world. Given unannotated corpora collected from humans following other humans’ instructions, our system automatically segments the corpus into labeled training data for a classification algorithm. Our interpretation algorithm is based on the observation that similar instructions uttered in similar contexts should lead to similar actions being taken in the virtual world. Given a previously unseen instruction, our system outputs actions that can be directly executed in the virtual world, based on what humans did when given similar instructions in the past. 2 Corpora situated in virtual worlds Our environment consists of six virtual worlds designed for the natural language generation shared task known as the GIVE Challenge (Koller et al., 2010), where a pair of partners must collaborate to solve a task in a 3D space (Figure 1). The “instruction follower” (IF) can move around in the virtual world, but has no knowledge of the task. The “instruction giver” (IG) types instructions to the IF in order to guide him to accomplish the task. Each corpus contains the IF’s actions and position recorded every 200 milliseconds, as well as the IG’s instruc- tions with their timestamps. We used two corpora for our experiments. The Cm corpus (Gargett et al., 2010) contains instructions given by multiple people, consisting of 37 games spanning 2163 instructions over 8: 17 hs. The Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi1c 8s1–186, Figure 1: A screenshot of a virtual world. The world consists of interconnecting hallways, rooms and objects Cs corpus (Benotti and Denis, 2011), gathered using a single IG, is composed of 63 games and 3417 in- structions, and was recorded in a span of 6:09 hs. It took less than 15 hours to collect the corpora through the web and the subjects reported that the experiment was fun. While the environment is restricted, people describe the same route and the same objects in extremely different ways. Below are some examples of instructions from our corpus all given for the same route shown in Figure 1. 1) out 2) walk down the passage 3) nowgo [sic] to the pink room 4) back to the room with the plant 5) Go through the door on the left 6) go through opening with yellow wall paper People describe routes using landmarks (4) or specific actions (2). They may describe the same object differently (5 vs 6). Instructions also differ in their scope (3 vs 1). Thus, even ignoring spelling and grammatical errors, navigation instructions contain considerable variation which makes interpreting them a challenging problem. 3 Learning from previous interpretations Our algorithm consists of two phases: annotation and interpretation. Annotation is performed only once and consists of automatically associating each IG instruction to an IF reaction. Interpretation is performed every time the system receives an instruc182 tion and consists of predicting an appropriate reaction given reactions observed in the corpus. Our method is based on the assumption that a reaction captures the semantics of the instruction that caused it. Therefore, if two utterances result in the same reaction, they are paraphrases of each other, and similar utterances should generate the same reaction. This approach enables us to predict reactions for previously-unseen instructions. 3.1 Annotation phase The key challenge in learning from massive amounts of easily-collected data is to automatically annotate an unannotated corpus. Our annotation method consists of two parts: first, segmenting a low-level interaction trace into utterances and corresponding reactions, and second, discretizing those reactions into canonical action sequences. Segmentation enables our algorithm to learn from traces of IFs interacting directly with a virtual world. Since the IF can move freely in the virtual world, his actions are a stream of continuous behavior. Segmentation divides these traces into reactions that follow from each utterance of the IG. Consider the following example starting at the situation shown in Figure 1: IG(1): go through the yellow opening IF(2): [walks out of the room] IF(3): [turns left at the intersection] IF(4): [enters the room with the sofa] IG(5): stop It is not clear whether the IF is doing h3, 4i because h neo tis c reacting htoe r1 t or Fbec isadu soei hge h 3is, being proactive. While one could manually annotate this data to remove extraneous actions, our goal is to develop automated solutions that enable learning from massive amounts of data. We decided to approach this problem by experimenting with two alternative formal definitions: 1) a strict definition that considers the maximum reaction according to the IF behavior, and 2) a loose defini- tion based on the empirical observation that, in situated interaction, most instructions are constrained by the current visually perceived affordances (Gibson, 1979; Stoia et al., 2006). We formally define behavior segmentation (Bhv) as follows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered. In the example, instruction 1corresponds to h2, 3, 4i . We formally d inefsitnrue visibility segmentation (Vis) as f Wolelows. A reaction rk to an instruction uk begins right after the instruction uk is uttered and ends right before the next instruction uk+1 is uttered or right after the IF leaves the area visible at 360◦ from where uk was uttered. In the example, instruction 1’s reaction would be limited to h2i because the intersection is nwootu vldisi bbele l ifmroimte dw htoer he2 tihe b eicnasutrsuec ttihoen was suetctetiroend. The Bhv and Vis methods define how to segment an interaction trace into utterances and their corresponding reactions. However, users frequently perform noisy behavior that is irrelevant to the goal of the task. For example, after hearing an instruction, an IF might go into the wrong room, realize the error, and leave the room. A reaction should not in- clude such irrelevant actions. In addition, IFs may accomplish the same goal using different behaviors: two different IFs may interpret “go to the pink room” by following different paths to the same destination. We would like to be able to generalize both reactions into one canonical reaction. As a result, our approach discretizes reactions into higher-level action sequences with less noise and less variation. Our discretization algorithm uses an automated planner and a planning representation of the task. This planning representation includes: (1) the task goal, (2) the actions which can be taken in the virtual world, and (3) the current state of the virtual world. Using the planning representation, the planner calculates an optimal path between the starting and ending states of the reaction, eliminating all unnecessary actions. While we use the classical planner FF (Hoffmann, 2003), our technique could also work with classical planning (Nau et al., 2004) or other techniques such as probabilistic planning (Bonet and Geffner, 2005). It is also not dependent on a particular discretization of the world in terms of actions. Now we are ready to define canonical reaction ck formally. Let Sk be the state of the virtual world when instruction uk was uttered, Sk+1 be the state of the world where the reaction ends (as defined by Bhv or Vis segmentation), and D be the planning domain representation of the virtual world. The canonical reaction to uk is defined as the sequence of actions 183 returned by the planner with Sk as initial state, Sk+1 as goal state and D as planning domain. 3.2 Interpretation phase The annotation phase results in a collection of (uk, ck) pairs. The interpretation phase uses these pairs to interpret new utterances in three steps. First, we filter the set of pairs into those whose reactions can be directly executed from the current IF position. Second, we group the filtered pairs according to their reactions. Third, we select the group with utterances most similar to the new utterance, and output that group’s reaction. Figure 2 shows the output of the first two steps: three groups of pairs whose reactions can all be executed from the IF’s current position. Figure 2: Utterance groups for this situation. Colored arrows show the reaction associated with each group. We treat the third step, selecting the most similar group for a new utterance, as a classification problem. We compare three different classification methods. One method uses nearest-neighbor classification with three different similarity metrics: Jaccard and Overlap coefficients (both of which measure the degree of overlap between two sets, differing only in the normalization of the final value (Nikravesh et al., 2005)), and Levenshtein Distance (a string met- ric for measuring the amount of differences between two sequences of words (Levenshtein, 1966)). Our second classification method employs a strategy in which we considered each group as a set of possible machine translations of our utterance, using the BLEU measure (Papineni et al., 2002) to select which group could be considered the best translation of our utterance. Finally, we trained an SVM classifier (Cortes and Vapnik, 1995) using the unigrams Corpus Cm Corpus Cs Algorithm Bhv Vis Bhv Vis Jaccard47%54%54%70% Overlap BLEU SVM Levenshtein 43% 44% 33% 21% 53% 52% 29% 20% 45% 54% 45% 8% 60% 50% 29% 17% Table 1: Accuracy comparison between Cm and Cs for Bhv and Vis segmentation of each paraphrase and the position of the IF as features, and setting their group as the output class using a libSVM wrapper (Chang and Lin, 2011). When the system misinterprets an instruction we use a similar approach to what people do in order to overcome misunderstandings. If the system executes an incorrect reaction, the IG can tell the system to cancel its current interpretation and try again using a paraphrase, selecting a different reaction. 4 Evaluation For the evaluation phase, we annotated both the Cm and Cs corpora entirely, and then we split them in an 80/20 proportion; the first 80% of data collected in each virtual world was used for training, while the remaining 20% was used for testing. For each pair (uk, ck) in the testing set, we used our algorithm to predict the reaction to the selected utterance, and then compared this result against the automatically annotated reaction. Table 1 shows the results. Comparing the Bhv and Vis segmentation strategies, Vis tends to obtain better results than Bhv. In addition, accuracy on the Cs corpus was generally higher than Cm. Given that Cs contained only one IG, we believe this led to less variability in the instructions and less noise in the training data. We evaluated the impact of user corrections by simulating them using the existing corpus. In case of a wrong response, the algorithm receives a second utterance with the same reaction (a paraphrase of the previous one). Then the new utterance is tested over the same set of possible groups, except for the one which was returned before. If the correct reaction is not predicted after four tries, or there are no utterances with the same reaction, the predictions are registered as wrong. To measure the effects of user corrections vs. without, we used a different evalu184 ation process for this algorithm: first, we split the corpus in a 50/50 proportion, and then we moved correctly predicted utterances from the testing set towards training, until either there was nothing more to learn or the training set reached 80% of the entire corpus size. As expected, user corrections significantly improve accuracy, as shown in Figure 3. The worst algorithm’s results improve linearly with each try, while the best ones behave asymptotically, barely improving after the second try. The best algorithm reaches 92% with just one correction from the IG. 5 Discussion and future work We presented an approach to instruction interpretation which learns from non-annotated logs of human behavior. Our empirical analysis shows that our best algorithm achieves 70% accuracy on this task, with no manual annotation required. When corrections are added, accuracy goes up to 92% for just one correction. We consider our results promising since state of the art semi-unsupervised approaches to instruction interpretation (Chen and Mooney, 2011) reports a 55% accuracy on manually segmented data. We plan to compare our system’s performance against human performance in comparable situations. Our informal observations of the GIVE corpus indicate that humans often follow instructions incorrectly, so our automated system’s performance may be on par with human performance. Although we have presented our approach in the context of 3D virtual worlds, we believe our technique is also applicable to other domains such as the web, video games, or Human Robot Interaction. Figure 3: Accuracy values with corrections over Cs References Luciana Benotti and Alexandre Denis. 2011. CL system: Giving instructions by corpus based selection. In Proceedings of the Generation Challenges Session at the 13th European Workshop on Natural Language Generation, pages 296–301, Nancy, France, September. Association for Computational Linguistics. Blai Bonet and H ´ector Geffner. 2005. mGPT: a probabilistic planner based on heuristic search. Journal of Artificial Intelligence Research, 24:933–944. S.R.K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 82–90, Suntec, Singapore, August. Association for Computational Linguistics. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27: 1– 27:27. Software available at http : / /www . cs ie . ntu .edu .tw/ ˜ c j l in/ l ibsvm. David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI2011), pages 859–865, August. Corinna Cortes and Vladimir Vapnik. 1995. Supportvector networks. Machine Learning, 20:273–297. Myroslava O. Dzikovska, James F. Allen, and Mary D. Swift. 2008. Linking semantic and knowledge representations in a multi-domain dialogue system. Journal of Logic and Computation, 18:405–430, June. Andrew Gargett, Konstantina Garoufi, Alexander Koller, and Kristina Striegnitz. 2010. The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC), Malta. James J. Gibson. 1979. The Ecological Approach to Visual Perception, volume 40. Houghton Mifflin. Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cognitive Science, 3 1(2): 197–231. J o¨rg Hoffmann. 2003. The Metric-FF planning system: Translating ”ignoring delete lists” to numeric state variables. Journal of Artificial Intelligence Research (JAIR), 20:291–341. Alexander Koller, Kristina Striegnitz, Andrew Gargett, Donna Byron, Justine Cassell, Robert Dale, Johanna Moore, and Jon Oberlander. 2010. Report on the second challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th In185 ternational Natural Language Generation Conference (INLG), Dublin. Tessa Lau, Clemens Drews, and Jeffrey Nichols. 2009. Interpreting written how-to instructions. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1433–1438, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Technical Report 8. Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artifi- cial Intelligence - Volume 2, pages 1475–1482. AAAI Press. Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE international conference on Human-robot interaction, HRI ’ 10, pages 251–258, New York, NY, USA. ACM. Dana Nau, Malik Ghallab, and Paolo Traverso. 2004. Automated Planning: Theory & Practice. Morgan Kaufmann Publishers Inc., California, USA. Masoud Nikravesh, Tomohiro Takagi, Masanori Tajima, Akiyoshi Shinmura, Ryosuke Ohgaya, Koji Taniguchi, Kazuyosi Kawahara, Kouta Fukano, and Akiko Aizawa. 2005. Soft computing for perception-based decision processing and analysis: Web-based BISCDSS. In Masoud Nikravesh, Lotfi Zadeh, and Janusz Kacprzyk, editors, Soft Computing for Information Processing and Analysis, volume 164 of Studies in Fuzziness and Soft Computing, chapter 4, pages 93– 188. Springer Berlin / Heidelberg. Jeff Orkin and Deb Roy. 2009. Automatic learning and generation of social behavior from collective human gameplay. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent SystemsVolume 1, volume 1, pages 385–392. International Foundation for Autonomous Agents and Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems. Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 3 11–3 18, Stroudsburg, PA, USA. Association for Computational Linguistics. Verena Rieser and Oliver Lemon. 2010. Learning human multimodal dialogue strategies. Natural Language Engineering, 16:3–23. Laura Stoia, Donna K. Byron, Darla Magdalene Shockley, and Eric Fosler-Lussier. 2006. Sentence planning for realtime navigational instructions. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, NAACLShort ’06, pages 157–160, Stroudsburg, PA, USA. Association for Computational Linguistics. Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings ofthe 48th Annual Meeting of the Association for Computational Linguistics, ACL ’ 10, pages 806–814, Stroudsburg, PA, USA. Association for Computational Linguistics. 186
3 0.56062925 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition
Author: David Chen
Abstract: Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (201 1) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the stateof-the-art results. We show that by changing the grammar of the formal meaning represen- . tation language and training on additional data collected from Amazon’s Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.
4 0.46809763 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances
Author: Asli Celikyilmaz ; Dilek Hakkani-Tur
Abstract: We describe a joint model for understanding user actions in natural language utterances. Our multi-layer generative approach uses both labeled and unlabeled utterances to jointly learn aspects regarding utterance’s target domain (e.g. movies), intention (e.g., finding a movie) along with other semantic units (e.g., movie name). We inject information extracted from unstructured web search query logs as prior information to enhance the generative process of the natural language utterance understanding model. Using utterances from five domains, our approach shows up to 4.5% improvement on domain and dialog act performance over cascaded approach in which each semantic component is learned sequentially and a supervised joint learning model (which requires fully labeled data).
5 0.4452742 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content
Author: Gregory Druck ; Bo Pang
Abstract: There are a growing number of popular web sites where users submit and review instructions for completing tasks as varied as building a table and baking a pie. In addition to providing their subjective evaluation, reviewers often provide actionable refinements. These refinements clarify, correct, improve, or provide alternatives to the original instructions. However, identifying and reading all relevant reviews is a daunting task for a user. In this paper, we propose a generative model that jointly identifies user-proposed refinements in instruction reviews at multiple granularities, and aligns them to the appropriate steps in the original instructions. Labeled data is not readily available for these tasks, so we focus on the unsupervised setting. In experiments in the recipe domain, our model provides 90. 1% F1 for predicting refinements at the review level, and 77.0% F1 for predicting refinement segments within reviews.
6 0.39684433 51 acl-2012-Collective Generation of Natural Image Descriptions
7 0.39396361 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy
8 0.38700667 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs
9 0.37164885 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places
10 0.37079966 76 acl-2012-Distributional Semantics in Technicolor
11 0.36898375 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation
12 0.36520326 24 acl-2012-A Web-based Evaluation Framework for Spatial Instruction-Giving Systems
13 0.36435002 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling
14 0.35413301 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model
15 0.34979004 113 acl-2012-INPROwidth.3emiSS: A Component for Just-In-Time Incremental Speech Synthesis
16 0.34923187 120 acl-2012-Information-theoretic Multi-view Domain Adaptation
17 0.3491461 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations
18 0.34702638 112 acl-2012-Humor as Circuits in Semantic Networks
19 0.34274808 64 acl-2012-Crosslingual Induction of Semantic Roles
20 0.34271219 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery
topicId topicWeight
[(25, 0.019), (26, 0.027), (28, 0.037), (30, 0.02), (37, 0.046), (38, 0.327), (39, 0.069), (49, 0.02), (59, 0.031), (74, 0.028), (82, 0.025), (84, 0.034), (85, 0.019), (90, 0.101), (92, 0.047), (94, 0.022), (99, 0.052)]
simIndex simValue paperId paperTitle
same-paper 1 0.70595765 129 acl-2012-Learning High-Level Planning from Text
Author: S.R.K. Branavan ; Nate Kushman ; Tao Lei ; Regina Barzilay
Abstract: Comprehending action preconditions and effects is an essential step in modeling the dynamics of the world. In this paper, we express the semantics of precondition relations extracted from text in terms of planning operations. The challenge of modeling this connection is to ground language at the level of relations. This type of grounding enables us to create high-level plans based on language abstractions. Our model jointly learns to predict precondition relations from text and to perform high-level planning guided by those relations. We implement this idea in the reinforcement learning framework using feedback automatically obtained from plan execution attempts. When applied to a complex virtual world and text describing that world, our relation extraction technique performs on par with a supervised baseline, yielding an F-measure of 66% compared to the baseline’s 65%. Additionally, we show that a high-level planner utilizing these extracted relations significantly outperforms a strong, text unaware baseline successfully completing 80% of planning tasks as compared to 69% for the baseline.1 –
2 0.60878712 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages
Author: Preslav Nakov ; Jorg Tiedemann
Abstract: We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.
3 0.41834694 191 acl-2012-Temporally Anchored Relation Extraction
Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo
Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.
Author: Pradeep Dasigi ; Weiwei Guo ; Mona Diab
Abstract: We describe an unsupervised approach to the problem of automatically detecting subgroups of people holding similar opinions in a discussion thread. An intuitive way of identifying this is to detect the attitudes of discussants towards each other or named entities or topics mentioned in the discussion. Sentiment tags play an important role in this detection, but we also note another dimension to the detection of people’s attitudes in a discussion: if two persons share the same opinion, they tend to use similar language content. We consider the latter to be an implicit attitude. In this paper, we investigate the impact of implicit and explicit attitude in two genres of social media discussion data, more formal wikipedia discussions and a debate discussion forum that is much more informal. Experimental results strongly suggest that implicit attitude is an important complement for explicit attitudes (expressed via sentiment) and it can improve the sub-group detection performance independent of genre.
5 0.41516984 187 acl-2012-Subgroup Detection in Ideological Discussions
Author: Amjad Abu-Jbara ; Pradeep Dasigi ; Mona Diab ; Dragomir Radev
Abstract: The rapid and continuous growth of social networking sites has led to the emergence of many communities of communicating groups. Many of these groups discuss ideological and political topics. It is not uncommon that the participants in such discussions split into two or more subgroups. The members of each subgroup share the same opinion toward the discussion topic and are more likely to agree with members of the same subgroup and disagree with members from opposing subgroups. In this paper, we propose an unsupervised approach for automatically detecting discussant subgroups in online communities. We analyze the text exchanged between the participants of a discussion to identify the attitude they carry toward each other and towards the various aspects of the discussion topic. We use attitude predictions to construct an attitude vector for each discussant. We use clustering techniques to cluster these vectors and, hence, determine the subgroup membership of each participant. We compare our methods to text clustering and other baselines, and show that our method achieves promising results.
6 0.41506875 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning
7 0.41488135 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle
8 0.41469508 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models
9 0.41326424 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling
10 0.41317913 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization
11 0.41279173 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale
12 0.41246036 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base
13 0.41240135 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool
14 0.41159296 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places
15 0.41157812 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence
16 0.41144621 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment
17 0.41141817 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures
18 0.41063118 167 acl-2012-QuickView: NLP-based Tweet Search
19 0.41026863 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification
20 0.40956318 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers