Abstract: Traditional computational approaches to referring expression generation operate in a deliberate manner, choosing the attributes to be included on the basis of their ability to distinguish the intended referent from its distractors. However, work in psycholinguistics suggests that speakers align their referring expressions with those used previously in the discourse, implying less deliberate choice and more subconscious reuse. This raises the question as to which is a more accurate characterisation of what people do. Using a corpus of dialogues containing 16,358 referring expressions, we explore this question via the generation of subsequent references in shared visual scenes. We use a machine learning approach to referring expression generation and demonstrate that incorporating features that correspond to the computational tradition does not match human referring behaviour as well as using features corresponding to the process of alignment. The results support the view that the traditional model of referring expression generation that is widely assumed in work on natural language generation may not in fact be correct; our analysis may also help explain the oft-observed redundancy found in humanproduced referring expressions.

1 However, work in psycholinguistics suggests that speakers align their referring expressions with those used previously in the discourse, implying less deliberate choice and more subconscious reuse. [sent-12, score-0.661]

2 Using a corpus of dialogues containing 16,358 referring expressions, we explore this question via the generation of subsequent references in shared visual scenes. [sent-14, score-0.885]

3 We use a machine learning approach to referring expression generation and demonstrate that incorporating features that correspond to the computational tradition does not match human referring behaviour as well as using features corresponding to the process of alignment. [sent-15, score-1.202]

4 The results support the view that the traditional model of referring expression generation that is widely assumed in work on natural language generation may not in fact be correct; our analysis may also help explain the oft-observed redundancy found in humanproduced referring expressions. [sent-16, score-1.208]

5 1 Introduction Computational work on referring expression generation (REG) has an extensive history, and a wide variety of algorithms have been proposed, dealing with various facets of what is recognised to be a complex problem. [sent-17, score-0.633]

6 Almost all of this work sees the task 1158 as being concerned with choosing those attributes of an intended referent that distinguish it from the other entities with which it might be confused (see, for example, Dale (1989), Dale and Reiter (1995), Krahmer et al. [sent-18, score-0.332]

7 Independently, an alternative way of thinking about reference has arisen within the psycholinguistics com- munity: there is now a long tradition of work that explores how a dialogue participant’s forms of reference are influenced by those previously used for a given entity. [sent-20, score-0.366]

8 Most recently, this line of work has been discussed in terms of the notions of alignment (Pickering and Garrod, 2004) and conceptual pacts (Clark and Wilkes-Gibbs, 1986; Brennan and Clark, 1996). [sent-21, score-0.272]

9 Using a large corpus ofreferring expressions in task-oriented dialogues, this paper presents a machine learning approach that allows us to combine features corresponding to the two perspectives. [sent-23, score-0.131]

10 Our results show that models based on the alignment perspective outperform models based on traditional REG considerations, as well as a number of simpler baselines. [sent-24, score-0.168]

11 In Section 3, we describe the iMAP Corpus and the referring expressions it contains. [sent-27, score-0.559]

12 1 The Algorithmic Approach We use the term algorithmic approach here to refer to the perspective that is common to the considerable body of work within computational linguistics on the problem of referring expression generation developed over the last 20 years. [sent-33, score-0.748]

13 This work has focused on the design of algorithms which take into account the context of reference in order to decide what properties of an entity should be mentioned in order to distinguish that entity from others with which it might be confused. [sent-35, score-0.123]

14 Early work was concerned with subsequent reference in discourse, inspired by Grosz and Sidner’s (1986) observations on how the attentional structure of a discourse made particular referents accessible at any given point. [sent-36, score-0.305]

15 More recently, attention has shifted to initial reference in visual domains, driven in large part by the availability of the TUNA dataset and the shared tasks that make use of it (Gatt et al. [sent-37, score-0.318]

16 Scenarios that require the generation ofreferences in multi-turn dialogues that concern visual scenes are likely to be among the first where we can expect computational approaches to referring expression generation to be practically useful. [sent-40, score-1.034]

17 Surprisingly, however, the more recent work on initial reference in visual domains and the earlier work on subsequent reference in discourse remain somewhat distinct and separate from each other, despite much the same algorithms having been used in both. [sent-41, score-0.574]

18 There is very little work that brings these two strands together by looking at both initial and subsequent references in dialogues that concern visual scenes. [sent-42, score-0.346]

19 However, their approach was concerned with choosing the type of reference to use (definite or indefinite, pronominal, bare or 1159 modified head noun), and not with the content of the reference; and their data set consisted of only 1242 referring expressions. [sent-45, score-0.73]

20 2 The Alignment Approach Meanwhile, starting with the early work of Carroll (1980), a quite distinct strand of research in psycholinguistics has explored how a speaker’s form of reference to an entity is impacted by the way that entity has been previously referred to in the discourse or dialogue. [sent-47, score-0.222]

21 The general idea behind what we will call the alignment approach is that a conversational participant will often adopt the same semantic, syntactic and lexical alternatives as the other party in a dialogue. [sent-48, score-0.178]

22 With respect to reference in particular, speakers are said to form conceptual pacts in their use of language (Clark and Wilkes-Gibbs, 1986; Brennan and Clark, 1996). [sent-50, score-0.296]

23 Recent work by Goudbeek and Krahmer (2010) supports the view that subconscious alignment does indeed take place at the level of content selection for referring expressions. [sent-52, score-0.69]

24 The participants in their study were more likely to use a dispreferred attribute to describe a target referent if this attribute had recently been used in a description by a confederate. [sent-53, score-0.403]

25 There is some work within natural language generation that attempts to model the process of alignment (Buschmeier et al. [sent-54, score-0.209]

26 , 2009; Janarthanam and Lemon, 2009), but this is predominantly concerned with what we might think of as the ‘lexical perspective’ , focussing on lexical choice rather than the selection of appropriate semantic content for distinguishing descriptions. [sent-55, score-0.177]

27 3 Combined Models This paper is not the first to look at how the algorithmic approach and the alignment approach might be integrated in REG. [sent-57, score-0.194]

28 However, their data set consists of only 393 referring expressions, compared to our 16,358, and these expressions had functions other than identification; most importantly, the entities referred to were not part of a shared visual scene as is the case in our data. [sent-60, score-0.754]

29 Gupta and Stent (2005) instantiated Dale and Reiter’s (1995) Incremental Algorithm with a preference ordering that favours the attributes that were used in the previous mention of the same referent. [sent-61, score-0.258]

30 In a second variant, they even require these attributes to be included in a subsequent reference. [sent-62, score-0.259]

31 Differently from most other work on REG, they extended the task to include ordering of the attributes in the surface form. [sent-63, score-0.188]

32 They therefore create a special evaluation metric that takes ordering into account, which makes it hard to compare the performance they report to that of any system that is not concerned with attribute ordering, such as ours. [sent-64, score-0.16]

33 Their evaluation set was also considerably smaller than ours: they used 1294 and 471 referring expressions from two different corpora, compared to our test set of 4947 referring expressions. [sent-65, score-1.021]

34 , 2007) is a collection of 256 dialogues between 32 participantpairs who contributed 8 dialogues each. [sent-70, score-0.16]

35 Both participants had a map of the same environment, but one participant’s map showed a route winding its way between the landmarks on the map; see Figure 1. [sent-71, score-0.248]

36 In addition to these inherent attributes of the landmarks, participants used spatial relations to other items on the map. [sent-84, score-0.282]

37 Each referring expression in the corpus is annotated with a unique identifier corresponding to the landmark that it describes and the semantic values of the attributes that it contains. [sent-85, score-0.84]

38 For each landmark R referred to in a dialogue, we view the sequence of references to this landmark as a coreference chain, notated hR1, R2 , . [sent-87, score-0.192]

39 From the corpus as a whole we extracted 34,127 referring expressions in 9558 chains. [sent-92, score-0.559]

40 References may be contributed to a chain by either speaker, and can be arbitrarily far apart: in the data, 4201 references are in the utterance immediately following the preceding reference in the chain, but the distance between references in a chain can be as high as 423 utterances. [sent-95, score-0.215]

41 We removed from the data any annotation that was not concerned with the four landmark attributes, type, colour, relation, or the landmark’s other distinguishing attribute. [sent-96, score-0.177]

42 Ordinal numbers that were annotated as the use of the number attribute were re-tagged as spatial relations, as these usually described the position of the target within a line of landmarks. [sent-98, score-0.213]

43 As a result of the removal of annotations not pertaining to the use of the four landmark attributes, 2785 referring expressions had no annotation left; we removed these instances from the final data set. [sent-99, score-0.655]

44 We also do not attempt to replicate the remaining 5552 plural referring expressions or the 3062 pro- Content Pattern Count 1161 Proportion hotheri589336. [sent-100, score-0.559]

45 1 nouns found in the However, we do include all of these instances in the feature extraction step, on the assumption that they might impact on the content of subsequent references. [sent-117, score-0.167]

46 Similarly, we filter out 6369 initial references after we have extracted features from them, since we focus here on the generation of subsequent reference only. [sent-118, score-0.305]

47 The remaining 16,358 referring expressions form the data which we use in our experiments. [sent-119, score-0.559]

48 Contrary to findings from other corpora, in which colour was used much more frequently (Gatt, 2007; Viethen and Dale, 2008), the colour attribute was used in only 26. [sent-120, score-0.491]

49 This is probably due to the often low reliability of colour in this task caused by the ink stains. [sent-122, score-0.245]

50 The proportion of referring expressions mentioning the target’s type might, at 38. [sent-123, score-0.598]

51 16% of the referring expressions, comparable to other corpora in the literature. [sent-129, score-0.462]

52 We can think of each referring expression as be- ing a linguistic realisation of a content pattern: this is the collection of attributes that are used in that instance. [sent-131, score-0.84]

53 The attributes can be derived from the property-level annotation given in the corpus. [sent-132, score-0.188]

54 So, for example, if a particular reference appears as the noun phrase the blue penguin, annotated semantically as hblue, penguini, then the corresponding ctiocnatlelynt pattern i,sp hcolour, kindi . [sent-133, score-0.164]

55 O thuer a ciomrr eiss ptoo replicate etnhet content pattern orf, keinacdhi referring expression in the corpus. [sent-134, score-0.693]

56 1 The Two Perspectives Our task is defined simply as follows: for each subsequent reference R in the corpus, can we predict the content pattern that will be used in that reference? [sent-137, score-0.331]

57 The alignment approach, on the other hand, can be summarised thus: Speakers align the forms of reference they use to be similar or identical to references that have been used before. [sent-140, score-0.255]

58 In particular, once a form of reference to the intended referent has been established, they tend to re-use that form of reference, or perhaps an abbreviated version of it. [sent-141, score-0.218]

59 The alignment approach would appear to be preferable on the grounds of computational cost: we would expect that retrieving a previously-used referring expression, or parts thereof, generally requires less computation than building a new referring expression from scratch. [sent-142, score-1.15]

60 shape of the ink blot(s) on the IF’s map Lmprop Features other Att type of the other attribute of the target [att] Value value for each att of target [att] Difference was att of target different between the two maps? [sent-144, score-0.994]

61 2Unfortunately, determining what counts as a change ofcontext, especially in visual scenes, is fraught with difficulty. [sent-154, score-0.195]

62 TradREG Features (Visual) Count Vis Distractors Prop Vis Same [att] number of visual distractors proportion of visual distractors with same att Dist Closest distance to the closest visual distractor Closest Same [att] has the closest distractor the same att? [sent-155, score-1.275]

63 Dist Closest Same [att] distance to the closest distractor of same att as target Cl Same type Same [att] has the closest distractor of the same type also the same att? [sent-156, score-0.572]

64 TradREG Features (Discourse) Count Intervening LMs number of other LMs mentioned since the last mention of the target Prop Intervening [att] proportion of intervening LMs for which att was used AND which have the same att as target Table 3: The TradREG feature set. [sent-157, score-0.894]

65 2 Features The number of factors that can be hypothesised as having an impact on the form of a referring expression in a dialogic setting associated with a visual do- main is very large. [sent-160, score-0.751]

66 Instead, we here capture a wide range of factors as features that can be used by a machine learning algorithm to automatically induce from the data a classifier that predicts for a given set of features the attributes that should be used in a referring expression. [sent-162, score-0.718]

67 Map Features capture design characteristics of the maps the current dialogue is about; Speaker Features capture the identity and role of the participants; and LMprop Features capture the inherent visual properties of the target referent. [sent-165, score-0.325]

68 Most importantly for our present considerations, 3In these tables, att is an abbreviatory variable that is instantiated once for each of the four attributes type, colour, relation, and the other distinguishing attribute of the landmark. [sent-167, score-0.65]

69 The abbreviation LM stands for landmark 1163 Alignment Features (Recency) Last Men Speaker Same who made the last mention of target? [sent-168, score-0.219]

70 Last Mention [att] was att used in the last mention of target? [sent-169, score-0.442]

71 Alignment Features (Frequency) Count [att] Dial how often has att been used in the dialogue? [sent-171, score-0.319]

72 Count [att] LM how often has att been used for target? [sent-172, score-0.319]

73 Quartile quartile of the dialogue the RE was uttered in Dial No number of dialogues already completed +1 Mention No number of previous mentions of target +1 Table 4: The Alignment feature set. [sent-173, score-0.273]

74 In addition to the main prediction class content pattern, the split was stratified for Speaker ID and Quartile to ensure that training and test set contained the same proportion of descriptions from each speaker and each quartile of the dialogues. [sent-177, score-0.358]

75 We used the J48 algorithm implemented in the Weka toolkit (Witten and Frank, 2005) to train decision trees with the task of judging, based on the given features, which content pattern should be used. [sent-178, score-0.184]

76 It generates the same content pattern that was used in the previous mention of the target referent. [sent-182, score-0.254]

77 MajorityClass generates the content pattern most commonly used in the training set. [sent-183, score-0.137]

78 This gives some measure of the overlap between two referring expressions, assigning a partial score if the two sets share attributes but are not identical. [sent-189, score-0.65]

79 Table 5 compares the performances of the three baselines and the decision trees based on the five feature subsets for each of the individual attributes and for the combined content pattern; note that the HeadNounOnly and RepeatLast baselines do not make attribute-specific predictions. [sent-211, score-0.331]

80 The table shows that the learned systems outperform all three baselines for the individual attributes as well as for the combined content pattern. [sent-212, score-0.284]

81 Also consistent with the results of the three individual feature sets, dropping the Ind features hurts performance more than dropping the TradREG features, but less than dropping the Alignment features. [sent-219, score-0.193]

82 These results suggest that considerations at the heart of traditional REG approaches do not play as important a role as those postulated by alignmentbased models for the selection of semantic content for subsequent referring expressions. [sent-222, score-0.723]

83 Even in the arguably much simpler non-dialogic domains of the REG competitions concerned with pure content selection, the best performing system achieved only 53% Accuracy (see Gatt et al. [sent-230, score-0.145]

84 ferent set of attributes from those included by the human speaker; however, the Accuracy score also counts as incorrect any set that only partly overlaps with the reference found in the test set. [sent-244, score-0.311]

85 A DICE score that is equal to the Accuracy score would mean that each referring expression was either reproduced perfectly, or that a set of attributes was chosen that did not overlap with the original one at all. [sent-246, score-0.744]

86 Interestingly, a large number of the referring expressions produced by the model trained only on TradREG features are subsets of the human reference. [sent-249, score-0.593]

87 This indicates that human speakers tend to include more attributes than are strictly speaking necessary to distinguish the landmark. [sent-250, score-0.221]

88 52 7 Table 7: Comparison of the predictions for the combined content pattern between the models trained on mutually exclusive feature sets. [sent-258, score-0.137]

89 the apparent redundancy that human-produced referring expressions contain. [sent-259, score-0.559]

90 Table 7 lists for those pairings of our learned models which were based on mutually exclusive feature sets how many referring expressions both models predicted correctly, how many both failed to predict, and how many were predicted correctly by either of the two models. [sent-261, score-0.559]

91 1166 7 Conclusions Using the largest corpus of referring expressions to date, we have shown how both the traditional computational view of REG and the alternative psycholinguistic alignment approach can be captured via a large set of features for machine learning. [sent-271, score-0.807]

92 First, we have demonstrated that a model using all these features to predict content patterns in subsequent references in shared visual scenes delivers an Accuracy of 58. [sent-274, score-0.445]

93 Second, our error analysis showed that the main reason for the low performance of a model based on traditional algorithmic features is that it often chooses too few attributes. [sent-278, score-0.132]

94 The fact that the model based on the alignment features does not make this mistake so frequently suggests that it may be the psycholinguistic considerations incorporated in our alignment features that lead people to add those additional attributes. [sent-279, score-0.436]

95 Finally, while the different models make the same correct predictions about the content of referring expressions in many cases, there are also a considerable number of cases where the models based on either the traditional algorithmic features (10. [sent-280, score-0.787]

96 Computational interpretations of the Gricean maxims in the generation of referring expressions. [sent-306, score-0.539]

97 Learning lexical alignment policies for generating referring 1167 expressions for spoken dialogue systems. [sent-346, score-0.774]

98 Learning content selection rules for generating object descriptions in dialogue. [sent-357, score-0.133]

99 Graphs and Booleans: On the generation of referring expressions. [sent-389, score-0.539]

100 The use of spatial relations in referring expression generation. [sent-395, score-0.611]

