acl acl2013 acl2013-184 knowledge-graph by maker-knowledge-mining

184 acl-2013-Identification of Speakers in Novels

Source: pdf

Author: Hua He ; Denilson Barbosa ; Grzegorz Kondrak

Abstract: Speaker identification is the task of at- tributing utterances to characters in a literary narrative. It is challenging to auto- mate because the speakers of the majority ofutterances are not explicitly identified in novels. In this paper, we present a supervised machine learning approach for the task that incorporates several novel features. The experimental results show that our method is more accurate and general than previous approaches to the problem.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Speaker identification is the task of at- tributing utterances to characters in a literary narrative. [sent-3, score-0.61]

2 It is challenging to auto- mate because the speakers of the majority ofutterances are not explicitly identified in novels. [sent-4, score-0.259]

3 In spite of a frequently expressed opinion that all novels are simply variations of a certain number of basic plots (Tobias, 2012), every novel has a unique plot (or several plots) and a different set of characters. [sent-8, score-0.297]

4 A precondition for understanding the relationship between characters and plot development in a novel is the identification of speakers behind all utterances. [sent-11, score-0.465]

5 However, the majority of utterances are not explicitly tagged with speaker names, as is the case in stage plays and film scripts. [sent-12, score-0.925]

6 Since manual annotation of novels is costly, a system for automatically determining speakers of utterances would facilitate other tasks related to ‡Department of Computing Science University of Alberta {deni l on gkondrak} @ ualbe rt a . [sent-14, score-0.699]

7 In this paper, we investigate the task of speaker identification in novels. [sent-20, score-0.637]

8 Since every novel has its own set of characters, speaker identification cannot be formulated as a straightforward tagging problem with a universal set of fixed tags. [sent-22, score-0.684]

9 We propose several novel features, including the speaker alternation pattern, the presence of vocatives in utterances, and unsupervised actor-topic features that associate speakers with utterances on the basis of their content. [sent-26, score-1.359]

10 2 Related Work Previous work on speaker identification includes both rule-based and machine-learning approaches. [sent-34, score-0.637]

11 The speech-verb-actor pattern is applied to the utterance, and the speaker is chosen from the available candidates on the basis of a scoring scheme. [sent-37, score-0.584]

12 They manually define 19 variations of frequent speaker patterns, and identify a total of 35 candidate speech verbs. [sent-39, score-0.616]

13 Elson and McKeown (2010) (henceforth referred to as EM2010) apply the supervised machine learning paradigm to a corpus of utterances extracted from novels. [sent-41, score-0.335]

14 They construct a single feature vector for each pair of an utterance and a speaker candidate, and experiment with various WEKA classifiers and score-combination methods. [sent-42, score-0.819]

15 To identify the speaker of a given utterance, they assume that all previous utterances are already correctly assigned to their speakers. [sent-43, score-0.888]

16 Our approach differs in considering the utterances in a sequence, rather than independently from each other, and in removing the unrealistic assumption that the previous utterances are correctly identified. [sent-44, score-0.67]

17 The speaker identification task has also been in- vestigated in other domains. [sent-45, score-0.637]

18 (2010) implement a rule-based system to enrich German cabinet protocols with automatic speaker attribution. [sent-53, score-0.586]

19 An utterance is a connected text that can be attributed to a single speaker. [sent-57, score-0.3]

20 Our task is to associate each utterance with a single speaker. [sent-58, score-0.266]

21 Utterances that are attributable to more than one speaker are rare; in such cases, we accept correctly identifying one of the speakers as sufficient. [sent-59, score-0.742]

22 In some cases, an utterance may include more than one quotationdelimited sequence of words, as in the following example. [sent-60, score-0.266]

23 ” In this case, the words said Jane are simply a speaker tag inserted into the middle of the quoted sentence. [sent-62, score-0.629]

24 We assume that all utterances within a paragraph can be attributed to a single speaker. [sent-64, score-0.423]

25 This “one speaker per paragraph” property is rarely violated in novels we identified only five such cases in Pride & Prejudice, usually involving one character citing another, or characters reading letters containing quotations. [sent-65, score-0.923]

26 We further assume that each utterance is contained — within a single paragraph. [sent-67, score-0.266]

27 The term dialogue denotes a series of utterances together with related narratives, which provide the context of conversations. [sent-70, score-0.375]

28 We define a dialogue as a series of utterances and intervening narratives, with no more than three continuous narratives. [sent-71, score-0.375]

29 The rationale here is that more than three narratives without any utterances are likely to signal the end of a particular dialogue. [sent-72, score-0.374]

30 We distinguish three types of utterances, which are listed with examples in Table 1: explicit speaker (identified by name within the paragraph), 1313 anaphoric speaker (identified by an anaphoric expression), and implicit speaker (no speaker information within the paragraph). [sent-73, score-2.576]

31 Typically, the majority of utterances belong to the implicit-speaker category. [sent-74, score-0.335]

32 In Pride & Prejudice only roughly 25% of the utterances have explicit speakers, and an even smaller 15% belong to the anaphoric-speaker category. [sent-75, score-0.401]

33 4 Speaker Identification In this section, we describe our method of extract- ing explicit speakers, and our ranking approach, which is designed to capture the speaker alternation pattern. [sent-77, score-0.732]

34 1 Extracting Speakers We extract explicit speakers by focusing on the speech verbs that appear before, after, or between quotations. [sent-79, score-0.257]

35 If a verb from the above short list cannot be found, any verb that is preceded by a name or a personal pronoun in the vicinity of the utterance is selected as the speech verb. [sent-81, score-0.435]

36 Once an explicit speaker’s name or an anaphoric expression is located, we determine the corresponding gender information by referring to the character list or by following straightforward rules to handle the anaphora. [sent-87, score-0.399]

37 For example, if the utterance is followed by the phrase she said, we infer that the gender of the speaker is female. [sent-88, score-0.916]

38 2 Ranking Model In spite of the highly sequential nature of the chains ofutterances, the speaker identification task is difficult to model as sequential prediction. [sent-90, score-0.709]

39 Although the sequential information is not directly modeled with tags, our system is able to indirectly utilize the speaker alternation pattern using the method described in the following section. [sent-94, score-0.77]

40 3 Speaker Alternation Pattern The speaker alternation pattern is often employed by authors in dialogues between two characters. [sent-97, score-0.744]

41 After the speakers are identified explicitly at the beginning of a dialogue, the remaining oddnumbered and even-numbered utterances are attributable to the first and second speaker, respectively. [sent-98, score-0.59]

42 Based on the speaker alternation pattern, we make the following two observations: 1. [sent-100, score-0.666]

43 The speaker of the n-th utterance in a dialogue is likely to be the same as the speaker of the (n 2)-th utterance. [sent-103, score-1.412]

44 − Our ranking model incorporates the speaker alternation pattern by utilizing a feature expansion scheme. [sent-104, score-0.727]

45 re Ins th adatare set for the utterances in the range [n 2, n + 1] aifr eth see corresponding explicit speaker nm−at2ch,ens+ +th1e] candidate speaker of the current utterance. [sent-110, score-1.535]

46 For example, since names of speakers are often mentioned in the vicinity of their utterances, we count the number of words separating the utterance and a name mention. [sent-116, score-0.557]

47 However, unlike EM2010, we consider only the two nearest characters in each direction, to reflect the observation that speakers tend to be mentioned by name immediately before or after their corresponding utterances. [sent-117, score-0.336]

48 Another feature is used to represent the number of appearances for speaker candidates. [sent-118, score-0.553]

49 2 Vocatives We propose a novel vocative feature, which encodes the character that is explicitly addressed in an utterance. [sent-167, score-0.251]

50 ” Intuitively, the speaker of the utterance is neither Mr. [sent-170, score-0.819]

51 Bingley nor Lizzy; however, the speaker of the next utterance is likely to be Lizzy. [sent-171, score-0.819]

52 We manually annotated vocatives in about 900 utterances from the training set. [sent-173, score-0.428]

53 About 25% of the names within utterance were tagged as vocatives. [sent-174, score-0.293]

54 We incorporate vocatives in our speaker identification system by means of three binary features that correspond to the utterances n 1, n 2, and n a−t c 3o. [sent-182, score-1.094]

55 atc Thhees tehaet ucraensd airdeat see speaker oeft etchtee dcu vrorecnatutterance n. [sent-184, score-0.553]

56 The gender matching feature encodes the gender agreement between a speaker candidate and the speaker of the current utterance. [sent-187, score-1.328]

57 The gender information extraction is applied to two utterance 1315 groups: the anaphoric-speaker utterances, and the explicit-speaker utterances. [sent-188, score-0.39]

58 1 to determine the gender of a speaker of the current utterance. [sent-190, score-0.65]

59 The presence matching feature indicates whether a speaker candidate is a likely participant in a dialogue. [sent-192, score-0.614]

60 Each dialogue consists of continuous utterance paragraphs together with neighboring narration paragraphs as defined in Section 3. [sent-193, score-0.483]

61 It can model dialogues in a literary text, which take place between two or more speakers conversing on different topics, as distributions over topics, which are also mixtures of the term distributions associated with multiple speakers. [sent-200, score-0.292]

62 The ACTM predicts the most likely speakers of a given utterance by considering the content of an utterance and its surrounding contexts. [sent-203, score-0.688]

63 The Actor- Topic-Term probabilities are calculated by using both the relationship of utterances and the surrounding textual clues. [sent-204, score-0.376]

64 In order to ensure high-quality speaker annotations, we developed a graphical interface (Figure 2), which displays the current utterance in context, and a list of characters in the novel. [sent-208, score-0.921]

65 After the speaker is selected by clicking a button, the text is scrolled automatically, with the next utterance highlighted in yellow. [sent-209, score-0.819]

66 For the purpose of a generalization experiment, we also utilize a corpus of utterances from the 19th and 20th century English novels compiled by EM2010. [sent-212, score-0.543]

67 Second, our data set includes annotations for all utterances in the novel, as opposed to only a subset of utterances from several novels, which are not necessarily contiguous. [sent-215, score-0.67]

68 For example, out of 308 utterances from The Steppe, 244 are in fact annotated, which raises the question whether the discarded utterances tend to be more difficult to annotate. [sent-217, score-0.67]

69 Table 4 shows the number of utterances in all 1www . [sent-218, score-0.335]

70 (test) 65 29 32 126 Emma 236 55 106 397 The Steppe 93 39 112 244 Table 4: The number of utterances in various data sets by the type (IS - Implicit Speaker; AS - Anaphoric Speaker; ES - Explicit Speaker). [sent-223, score-0.335]

71 Since our goal is to match utterances to characters rather than to name mentions, a preprocessing step is performed to produce a list of characters in the novel and their aliases. [sent-226, score-0.664]

72 We apply a name entity tagger, and then group the names into sets of character aliases, together with their gender information. [sent-228, score-0.27]

73 , 2012); however, since our focus is on speaker identification, we decided to avoid introducing annotation errors at this stage. [sent-231, score-0.553]

74 7 Evaluation In this section, we describe experiments conducted to evaluate our speaker identification approach. [sent-237, score-0.637]

75 In an attempt to reproduce the evaluation methodology of EM2010, we also test the ORACLE model, which has access to the gold-standard information about the speakers of eight neighboring utterances in the Pride & P. [sent-241, score-0.549]

76 1 Results Table 5 shows the results of the models trained on annotated utterances from Pride & Prejudice on three test sets. [sent-258, score-0.335]

77 This renders our gender feature virtually useless, and results in lower accuracy on anaphoric speakers than on explicit speakers. [sent-263, score-0.439]

78 On the other hand, Chekhov prefers to mention speaker names in the dialogues (46% of utterances are in the explicit-speaker category), which makes his prose slightly easier in terms of speaker identification. [sent-264, score-1.515]

79 The relative order of the models is the same on all three test sets, with the NEIGHBORS model consistently outperforming the INDIVIDUAL model, which indicates the importance of capturing the speaker alternation pattern. [sent-265, score-0.666]

80 Unsurprisingly, the explicit speaker is the easiest category, with nearly perfect accuracy. [sent-268, score-0.619]

81 Both the INDIVIDUAL and the NEIGHBORS models do better on anaphoric speakers than on implicit speakers, which is also expected. [sent-269, score-0.286]

82 In addition, anaphoric speaker is the least frequent of the three categories. [sent-285, score-0.643]

83 In addition, they utilize the ground truth speaker information of the preceding utterances. [sent-304, score-0.59]

84 8 Extracting Family Relationships In this section, we describe an application of the speaker identification system to the extraction of family relationships. [sent-314, score-0.735]

85 Our goal is to construct networks in which edges are labeled by the mutual relationships between characters in a novel. [sent-317, score-0.227]

86 Our approach to building a social network from the novel is to build an active database of relationships explicitly mentioned in the text, which is expanded by triggering the execution of queries that deduce implicit relations. [sent-329, score-0.264]

87 The following example illustrates how speaker identification helps in the extraction of social relations among characters. [sent-331, score-0.715]

88 ” If the speakers are correctly identified, the utterances are attributed to Mr. [sent-336, score-0.525]

89 Furthermore, the second utterance implies that its speaker is the wife of the preceding speaker. [sent-339, score-0.875]

90 Table Character contains all characters in the book, each with a unique identifier and gender information, while Table Relation contains all relationships that are explicitly mentioned in the text or derived through reasoning. [sent-346, score-0.325]

91 The rule derives a new relationship indicating that character c1 is the wife of character c2 if it is known (through an explicit mention in the text) that c2 is the husband of c1. [sent-348, score-0.325]

92 In our experiment with Pride & Prejudice, a total of 55 explicitly indicated relationships were automatically identified once the utterances were attributed to the characters. [sent-352, score-0.524]

93 9 Conclusion and Future Work We have presented a novel approach to identifying speakers of utterances in novels. [sent-356, score-0.538]

94 Our system incorporates a variety of novel features which utilize vocatives, unsupervised actor-topic models, and the speaker alternation pattern. [sent-357, score-0.809]

95 The extraction of social networks in novels that we discussed in Section 8 would benefit from the introduction of additional inference rules, and could be extended to capture more subtle notions of sentiment or relationship among characters, as well as their development over time. [sent-362, score-0.326]

96 We have demonstrated that speaker identification can help extract family relationships, but the converse is also true. [sent-363, score-0.708]

97 ” 1319 In order to deduce the speaker of the utterance, we need to combine the three pieces of information: (a) the utterance is addressed to Lizzy (voca- tive prediction), (b) the utterance is produced by Lizzy’s father (pronoun resolution), and (c) Mr. [sent-365, score-1.12]

98 Similarly, in the task of compiling a list of characters, which involves resolving aliases such as Caroline, Caroline Bingley, and Miss Bingley, simultaneous extraction of family relationships would help detect the ambiguity of Miss Benett, which can refer to any of several sisters. [sent-367, score-0.241]

99 A joint approach to resolving speaker attribution, relationship extraction, co-reference resolution, and alias-to-character mapping would not only improve the accuracy on all these tasks, but also represent a step towards deeper understanding of complex plots and stories. [sent-368, score-0.668]

100 A naive salience-based method for speaker identification in fiction books. [sent-408, score-0.637]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('speaker', 0.553), ('utterances', 0.335), ('utterance', 0.266), ('pride', 0.214), ('novels', 0.171), ('speakers', 0.156), ('lizzy', 0.13), ('steppe', 0.13), ('bennet', 0.121), ('emma', 0.121), ('prejudice', 0.115), ('alternation', 0.113), ('characters', 0.102), ('vocative', 0.099), ('gender', 0.097), ('actm', 0.093), ('bingley', 0.093), ('vocatives', 0.093), ('anaphoric', 0.09), ('literary', 0.089), ('relationships', 0.089), ('identification', 0.084), ('elson', 0.082), ('name', 0.078), ('austen', 0.074), ('family', 0.071), ('miss', 0.068), ('character', 0.068), ('neighbors', 0.067), ('explicit', 0.066), ('oracle', 0.063), ('yes', 0.059), ('neighboring', 0.058), ('chekhov', 0.056), ('denilson', 0.056), ('wife', 0.056), ('aliases', 0.054), ('paragraph', 0.054), ('social', 0.051), ('dear', 0.049), ('jane', 0.048), ('dialogues', 0.047), ('novel', 0.047), ('principal', 0.046), ('quotations', 0.046), ('plots', 0.044), ('chapters', 0.043), ('relationship', 0.041), ('paragraphs', 0.041), ('said', 0.04), ('dialogue', 0.04), ('implicit', 0.04), ('narratives', 0.039), ('celikyilmaz', 0.038), ('kondrak', 0.038), ('utilize', 0.037), ('isasestotal', 0.037), ('krestel', 0.037), ('makazhanov', 0.037), ('narration', 0.037), ('ofutterances', 0.037), ('salamin', 0.037), ('ualbe', 0.037), ('explicitly', 0.037), ('quoted', 0.036), ('sequential', 0.036), ('networks', 0.036), ('grzegorz', 0.035), ('father', 0.035), ('speech', 0.035), ('plot', 0.035), ('caroline', 0.034), ('elizabeth', 0.034), ('attribution', 0.034), ('attributed', 0.034), ('presence', 0.033), ('attributable', 0.033), ('cabinet', 0.033), ('sarmento', 0.033), ('typographical', 0.033), ('pattern', 0.031), ('barbosa', 0.03), ('vicinity', 0.03), ('glass', 0.03), ('accuracy', 0.03), ('incorporates', 0.03), ('features', 0.029), ('identified', 0.029), ('candidate', 0.028), ('names', 0.027), ('quotes', 0.027), ('andrews', 0.027), ('extraction', 0.027), ('extracting', 0.027), ('pouliquen', 0.026), ('bethard', 0.026), ('conventions', 0.026), ('husband', 0.026), ('individual', 0.026), ('pronoun', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 184 acl-2013-Identification of Speakers in Novels

Author: Hua He ; Denilson Barbosa ; Grzegorz Kondrak

2 0.20898114 190 acl-2013-Implicatures and Nested Beliefs in Approximate Decentralized-POMDPs

Author: Adam Vogel ; Christopher Potts ; Dan Jurafsky

Abstract: Conversational implicatures involve reasoning about multiply nested belief structures. This complexity poses significant challenges for computational models of conversation and cognition. We show that agents in the multi-agent DecentralizedPOMDP reach implicature-rich interpretations simply as a by-product of the way they reason about each other to maximize joint utility. Our simulations involve a reference game of the sort studied in psychology and linguistics as well as a dynamic, interactional scenario involving implemented artificial agents.

3 0.1701894 282 acl-2013-Predicting and Eliciting Addressee's Emotion in Online Dialogue

Author: Takayuki Hasegawa ; Nobuhiro Kaji ; Naoki Yoshinaga ; Masashi Toyoda

Abstract: While there have been many attempts to estimate the emotion of an addresser from her/his utterance, few studies have explored how her/his utterance affects the emotion of the addressee. This has motivated us to investigate two novel tasks: predicting the emotion of the addressee and generating a response that elicits a specific emotion in the addressee’s mind. We target Japanese Twitter posts as a source of dialogue data and automatically build training data for learning the predictors and generators. The feasibility of our approaches is assessed by using 1099 utterance-response pairs that are built by . five human workers.

4 0.12494691 379 acl-2013-Utterance-Level Multimodal Sentiment Analysis

Author: Veronica Perez-Rosas ; Rada Mihalcea ; Louis-Philippe Morency

Abstract: During real-life interactions, people are naturally gesturing and modulating their voice to emphasize specific points or to express their emotions. With the recent growth of social websites such as YouTube, Facebook, and Amazon, video reviews are emerging as a new source of multimodal and natural opinions that has been left almost untapped by automatic opinion analysis techniques. This paper presents a method for multimodal sentiment classification, which can identify the sentiment expressed in utterance-level visual datastreams. Using a new multimodal dataset consisting of sentiment annotated utterances extracted from video reviews, we show that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality.

5 0.11883128 49 acl-2013-An annotated corpus of quoted opinions in news articles

Author: Tim O'Keefe ; James R. Curran ; Peter Ashwell ; Irena Koprinska

Abstract: Quotes are used in news articles as evidence of a person’s opinion, and thus are a useful target for opinion mining. However, labelling each quote with a polarity score directed at a textually-anchored target can ignore the broader issue that the speaker is commenting on. We address this by instead labelling quotes as supporting or opposing a clear expression of a point of view on a topic, called a position statement. Using this we construct a corpus covering 7 topics with 2,228 quotes.

6 0.11763806 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays

7 0.10396747 311 acl-2013-Semantic Neighborhoods as Hypergraphs

8 0.090632737 90 acl-2013-Conditional Random Fields for Responsive Surface Realisation using Global Features

9 0.085130297 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

10 0.079925045 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

11 0.071938016 121 acl-2013-Discovering User Interactions in Ideological Discussions

12 0.07102605 252 acl-2013-Multigraph Clustering for Unsupervised Coreference Resolution

13 0.068001717 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

14 0.066924907 255 acl-2013-Name-aware Machine Translation

15 0.062517345 220 acl-2013-Learning Latent Personas of Film Characters

16 0.060636424 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

17 0.059655719 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization

18 0.058040623 312 acl-2013-Semantic Parsing as Machine Translation

19 0.057423603 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

20 0.057272337 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.156), (1, 0.071), (2, -0.033), (3, 0.009), (4, -0.006), (5, -0.0), (6, 0.024), (7, -0.011), (8, 0.022), (9, 0.058), (10, -0.064), (11, 0.012), (12, -0.033), (13, 0.027), (14, -0.069), (15, -0.069), (16, -0.039), (17, 0.064), (18, 0.032), (19, -0.045), (20, -0.142), (21, -0.129), (22, 0.105), (23, -0.024), (24, -0.018), (25, 0.154), (26, 0.083), (27, -0.017), (28, 0.061), (29, 0.059), (30, 0.042), (31, -0.024), (32, 0.035), (33, 0.042), (34, 0.014), (35, -0.002), (36, 0.095), (37, -0.032), (38, 0.028), (39, 0.08), (40, -0.047), (41, 0.031), (42, 0.037), (43, 0.047), (44, 0.075), (45, 0.012), (46, 0.083), (47, -0.1), (48, -0.001), (49, -0.24)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94398695 184 acl-2013-Identification of Speakers in Novels

Author: Hua He ; Denilson Barbosa ; Grzegorz Kondrak

2 0.81433976 190 acl-2013-Implicatures and Nested Beliefs in Approximate Decentralized-POMDPs

Author: Adam Vogel ; Christopher Potts ; Dan Jurafsky

3 0.66480178 282 acl-2013-Predicting and Eliciting Addressee's Emotion in Online Dialogue

Author: Takayuki Hasegawa ; Nobuhiro Kaji ; Naoki Yoshinaga ; Masashi Toyoda

4 0.61837071 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

Author: Naresh Kumar Elluru ; Anandaswarup Vadapalli ; Raghavendra Elluru ; Hema Murthy ; Kishore Prahallad

Abstract: In this paper, we relook at the problem of pronunciation of English words using native phone set. Specifically, we investigate methods of pronouncing English words using Telugu phoneset in the con- text of Telugu Text-to-Speech. We compare phone-phone substitution and wordphone mapping for pronunciation of English words using Telugu phones. We are not considering other than native language phoneset in all our experiments. This differentiates our approach from other works in polyglot speech synthesis.

5 0.59066534 90 acl-2013-Conditional Random Fields for Responsive Surface Realisation using Global Features

Author: Nina Dethlefs ; Helen Hastie ; Heriberto Cuayahuitl ; Oliver Lemon

Abstract: Surface realisers in spoken dialogue systems need to be more responsive than conventional surface realisers. They need to be sensitive to the utterance context as well as robust to partial or changing generator inputs. We formulate surface realisation as a sequence labelling task and combine the use of conditional random fields (CRFs) with semantic trees. Due to their extended notion of context, CRFs are able to take the global utterance context into account and are less constrained by local features than other realisers. This leads to more natural and less repetitive surface realisation. It also allows generation from partial and modified inputs and is therefore applicable to incremental surface realisation. Results from a human rating study confirm that users are sensitive to this extended notion of context and assign ratings that are significantly higher (up to 14%) than those for taking only local context into account.

6 0.51787466 30 acl-2013-A computational approach to politeness with application to social factors

7 0.51195675 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

8 0.50919163 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

9 0.47733039 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information

10 0.4717086 278 acl-2013-Patient Experience in Online Support Forums: Modeling Interpersonal Interactions and Medication Use

11 0.46351096 63 acl-2013-Automatic detection of deception in child-produced speech using syntactic complexity features

12 0.45765376 311 acl-2013-Semantic Neighborhoods as Hypergraphs

13 0.45622507 86 acl-2013-Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures

14 0.44468117 209 acl-2013-Joint Modeling of News Readerâ•Žs and Comment Writerâ•Žs Emotions

15 0.43420649 337 acl-2013-Tag2Blog: Narrative Generation from Satellite Tag Data

16 0.42788133 379 acl-2013-Utterance-Level Multimodal Sentiment Analysis

17 0.41216078 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

18 0.4030987 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision

19 0.40109843 49 acl-2013-An annotated corpus of quoted opinions in news articles

20 0.39641714 287 acl-2013-Public Dialogue: Analysis of Tolerance in Online Discussions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.042), (6, 0.032), (11, 0.039), (15, 0.013), (24, 0.48), (26, 0.052), (35, 0.067), (42, 0.03), (48, 0.031), (70, 0.046), (88, 0.027), (90, 0.02), (95, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98130208 29 acl-2013-A Visual Analytics System for Cluster Exploration

Author: Andreas Lamprecht ; Annette Hautli ; Christian Rohrdantz ; Tina Bogel

Abstract: This paper offers a new way of representing the results of automatic clustering algorithms by employing a Visual Analytics system which maps members of a cluster and their distance to each other onto a twodimensional space. A case study on Urdu complex predicates shows that the system allows for an appropriate investigation of linguistically motivated data. 1 Motivation In recent years, Visual Analytics systems have increasingly been used for the investigation of linguistic phenomena in a number of different areas, starting from literary analysis (Keim and Oelke, 2007) to the cross-linguistic comparison of language features (Mayer et al., 2010a; Mayer et al., 2010b; Rohrdantz et al., 2012a) and lexical semantic change (Rohrdantz et al., 2011; Heylen et al., 2012; Rohrdantz et al., 2012b). Visualization has also found its way into the field of computational linguistics by providing insights into methods such as machine translation (Collins et al., 2007; Albrecht et al., 2009) or discourse parsing (Zhao et al., 2012). One issue in computational linguistics is the interpretability of results coming from machine learning algorithms and the lack of insight they offer on the underlying data. This drawback often prevents theoretical linguists, who work with computational models and need to see patterns on large data sets, from drawing detailed conclusions. The present paper shows that a Visual Analytics system facilitates “analytical reasoning [...] by an interactive visual interface” (Thomas and Cook, 2006) and helps resolving this issue by offering a customizable, in-depth view on the statistically generated result and simultaneously an at-a-glance overview of the overall data set. In particular, we focus on the visual representa- tion of automatically generated clusters, in itself not a novel idea as it has been applied in other fields like the financial sector, biology or geography (Schreck et al., 2009). But as far as the literature is concerned, interactive systems are still less common, particularly in computational linguistics, and they have not been designed for the specific needs of theoretical linguists. This paper offers a method of visually encoding clusters and their internal coherence with an interactive user interface, which allows users to adjust underlying parameters and their views on the data depending on the particular research question. By this, we partly open up the “black box” of machine learning. The linguistic phenomenon under investigation, for which the system has originally been designed, is the varied behavior of nouns in N+V CP complex predicates in Urdu (e.g., memory+do = ‘to remember’) (Mohanan, 1994; Ahmed and Butt, 2011), where, depending on the lexical semantics of the noun, a set of different light verbs is chosen to form a complex predicate. The aim is an automatic detection of the different groups of nouns, based on their light verb distribution. Butt et al. (2012) present a static visualization for the phenomenon, whereas the present paper proposes an interactive system which alleviates some of the previous issues with respect to noise detection, filtering, data interaction and cluster coherence. For this, we proceed as follows: section 2 explains the proposed Visual Analytics system, followed by the linguistic case study in section 3. Section 4 concludes the paper. 2 The system The system requires a plain text file as input, where each line corresponds to one data object.In our case, each line corresponds to one Urdu noun (data object) and contains its unique ID (the name of the noun) and its bigram frequencies with the 109 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 109–1 4, four light verbs under investigation, namely kar ‘do’, ho ‘be’, hu ‘become’ and rakH ‘put’ ; an exemplary input file is shown in Figure 1. From a data analysis perspective, we have four- dimensional data objects, where each dimension corresponds to a bigram frequency previously extracted from a corpus. Note that more than four dimensions can be loaded and analyzed, but for the sake of simplicity we focus on the fourdimensional Urdu example for the remainder of this paper. Moreover, it is possible to load files containing absolute bigram frequencies and relative frequencies. When loading absolute frequencies, the program will automatically calculate the relative frequencies as they are the input for the clustering. The absolute frequencies, however, are still available and can be used for further processing (e.g. filtering). Figure 1: preview of appropriate file structures 2.1 Initial opening and processing of a file It is necessary to define a metric distance function between data objects for both clustering and visualization. Thus, each data object is represented through a high dimensional (in our example fourdimensional) numerical vector and we use the Euclidean distance to calculate the distances between pairs of data objects. The smaller the distance between two data objects, the more similar they are. For visualization, the high dimensional data is projected onto the two-dimensional space of a computer screen using a principal component analysis (PCA) algorithm1 . In the 2D projection, the distances between data objects in the highdimensional space, i.e. the dissimilarities of the bigram distributions, are preserved as accurately as possible. However, when projecting a highdimensional data space onto a lower dimension, some distinctions necessarily level out: two data objects may be far apart in the high-dimensional space, but end up closely together in the 2D projection. It is important to bear in mind that the 2D visualization is often quite insightful, but interpre1http://workshop.mkobos.com/201 1/java-pca- transformation-library/ tations have to be verified by interactively investigating the data. The initial clusters are calculated (in the highdimensional data space) using a default k-Means algorithm2 with k being a user-defined parameter. There is also the option of selecting another clustering algorithm, called the Greedy Variance Minimization3 (GVM), and an extension to include further algorithms is under development. 2.2 Configuration & Interaction 2.2.1 The main window The main window in Figure 2 consists of three areas, namely the configuration area (a), the visualization area (b) and the description area (c). The visualization area is mainly built with the piccolo2d library4 and initially shows data objects as colored circles with a variable diameter, where color indicates cluster membership (four clusters in this example). Hovering over a dot displays information on the particular noun, the cluster membership and the light verb distribution in the de- scription area to the right. By using the mouse wheel, the user can zoom in and out of the visualization. A very important feature for the task at hand is the possibility to select multiple data objects for further processing or for filtering, with a list of selected data objects shown in the description area. By right-clicking on these data objects, the user can assign a unique class (and class color) to them. Different clustering methods can be employed using the options item in the menu bar. Another feature of the system is that the user can fade in the cluster centroids (illustrated by a larger dot in the respective cluster color in Figure 2), where the overall feature distribution of the cluster can be examined in a tooltip hovering over the corresponding centroid. 2.2.2 Visually representing data objects To gain further insight into the data distribution based on the 2D projection, the user can choose between several ways to visualize the individual data objects, all of which are shown in Figure 3. The standard visualization type is shown on the left and consists of a circle which encodes cluster membership via color. 2http://java-ml.sourceforge.net/api/0.1.7/ (From the JML library) 3http://www.tomgibara.com/clustering/fast-spatial/ 4http://www.piccolo2d.org/ 110 Figure 2: Overview of the main window of the system, including the configuration area (a), the visualization area (b) and the description area (c). Large circles are cluster centroids. Figure 3: Different visualizations of data points Alternatively, normal glyphs and star glyphs can be displayed. The middle part of Figure 3 shows the data displayed with normal glyphs. In linestarinorthpsiflvtrheinorqsbgnutheviasnemdocwfya,proepfthlpdienaoecsr.nihetloa Titnghve det clockwise around the center according to their occurrence in the input file. This view has the advantage that overall feature dominance in a cluster can be seen at-a-glance. The visualization type on the right in Figure 3 agislnycpaehlxset. dnstHhioe nrset ,oarthngeolyrmlpinhae,l endings are connected, forming a “star”. As in the representation with the glyphs, this makes similar data objects easily recognizable and comparable with each other. 2.2.3 Filtering options Our systems offers options for filtering data ac- cording to different criteria. Filter by means of bigram occurrence By activating the bigram occurrence filtering, it is possible to only show those nouns, which occur in bigrams with a certain selected subset of all features (light verbs) only. This is especially useful when examining possible commonalities. Filter selected words Another opportunity of showing only items of interest is to select and display them separately. The PCA is recalculated for these data objects and the visualization is stretched to the whole area. 111 Filter selected cluster Additionally, the user can visualize a specific cluster of interest. Again, the PCA is recalculated and the visualization stretched to the whole area. The cluster can then be manually fine-tuned and cleaned, for instance by removing wrongly assigned items. 2.2.4 Options to handle overplotting Due to the nature of the data, much overplotting occurs. For example, there are many words, which only occur with one light verb. The PCA assigns the same position to these words and, as a consequence, only the top bigram can be viewed in the visualization. In order to improve visual access to overplotted data objects, several methods that allow for a more differentiated view of the data have been included and are described in the following paragraphs. Change transparency of data objects By modifying the transparency with the given slider, areas with a dense data population can be readily identified, as shown in the following example: Repositioning of data objects To reduce the overplotting in densely populated areas, data objects can be repositioned randomly having a fixed deviation from their initial position. The degree of deviation can be interactively determined by the user employing the corresponding slider: The user has the option to reposition either all data objects or only those that are selected in advance. Frequency filtering If the initial data contains absolute bigram frequencies, the user can filter the visualized words by frequency. For example, many nouns occur only once and therefore have an observed probability of 100% for co-occurring with one of the light verbs. In most cases it is useful to filter such data out. Scaling data objects If the user zooms beyond the maximum zoom factor, the data objects are scaled down. This is especially useful, if data objects are only partly covered by many other objects. In this case, they become fully visible, as shown in the following example: 2.3 Alternative views on the data In order to enable a holistic analysis it is often valuable to provide the user with different views on the data. Consequently, we have integrated the option to explore the data with further standard visualization methods. 2.3.1 Correlation matrix The correlation matrix in Figure 4 shows the correlations between features, which are visualized by circles using the following encoding: The size of a circle represents the correlation strength and the color indicates whether the corresponding features are negatively (white) or positively (black) correlated. Figure 4: example of a correlation matrix 2.3.2 Parallel coordinates The parallel coordinates diagram shows the distribution of the bigram frequencies over the different dimensions (Figure 5). Every noun is represented with a line, and shows, when hovered over, a tooltip with the most important information. To filter the visualized words, the user has the option of displaying previously selected data objects, or s/he can restrict the value range for a feature and show only the items which lie within this range. 2.3.3 Scatter plot matrix To further examine the relation between pairs of features, a scatter plot matrix can be used (Figure 6). The individual scatter plots give further insight into the correlation details of pairs of features. 112 Figure 5: Parallel coordinates diagram Figure 6: Example showing a scatter plot matrix. 3 Case study In principle, the Visual Analytics system presented above can be used for any kind of cluster visualization, but the built-in options and add-ons are particularly designed for the type of work that linguists tend to be interested in: on the one hand, the user wants to get a quick overview of the overall patterns in the phenomenon, but on the same time, the system needs to allow for an in-depth data inspection. Both is given in the system: The overall cluster result shown in Figure 2 depicts the coherence of clusters and therefore the overall pattern of the data set. The different glyph visualizations in Figure 3 illustrate the properties of each cluster. Single data points can be inspected in the description area. The randomization of overplotted data points helps to see concentrated cluster patterns where light verbs behave very similarly in different noun+verb complex predicates. The biggest advantage of the system lies in the ability for interaction: Figure 7 shows an example of the visualization used in Butt et al. (2012), the input being the same text file as shown in Figure 1. In this system, the relative frequencies of each noun with each light verb is correlated with color saturation the more saturated the color to the right of the noun, the higher the relative frequency of the light verb occurring with it. The number of the cluster (here, 3) and the respective nouns (e.g. kAm ‘work’) is shown to the left. The user does — not get information on the coherence of the cluster, nor does the visualization show prototypical cluster patterns. Figure 7: Cluster visualization in Butt et al. (2012) Moreover, the system in Figure 7 only has a limited set of interaction choices, with the consequence that the user is not able to adjust the underlying data set, e.g. by filtering out noise. However, Butt et al. (2012) report that the Urdu data is indeed very noisy and requires a manual cleaning of the data set before the actual clustering. In the system presented here, the user simply marks conspicuous regions in the visualization panel and removes the respective data points from the original data set. Other filtering mechanisms, e.g. the removal of low frequency items which occur due to data sparsity issues, can be removed from the overall data set by adjusting the parameters. A linguistically-relevant improvement lies in the display of cluster centroids, in other words the typical noun + light verb distribution of a cluster. This is particularly helpful when the linguist wants to pick out prototypical examples for the cluster in order to stipulate generalizations over the other cluster members. 113 4 Conclusion In this paper, we present a novel visual analytics system that helps to automatically analyze bigrams extracted from corpora. The main purpose is to enable a more informed and steered cluster analysis than currently possible with standard methods. This includes rich options for interaction, e.g. display configuration or data manipulation. Initially, the approach was motivated by a concrete research problem, but has much wider applicability as any kind of high-dimensional numerical data objects can be loaded and analyzed. However, the system still requires some basic understanding about the algorithms applied for clustering and projection in order to prevent the user to draw wrong conclusions based on artifacts. Bearing this potential pitfall in mind when performing the analysis, the system enables a much more insightful and informed analysis than standard noninteractive methods. In the future, we aim to conduct user experiments in order to learn more about how the functionality and usability could be further enhanced. Acknowledgments This work was partially funded by the German Research Foundation (DFG) under grant BU 1806/7-1 “Visual Analysis of Language Change and Use Patterns” and the German Fed- eral Ministry of Education and Research (BMBF) under grant 01461246 “VisArgue” under research grant. References Tafseer Ahmed and Miriam Butt. 2011. Discovering Semantic Classes for Urdu N-V Complex Predicates. In Proceedings of the international Conference on Computational Semantics (IWCS 2011), pages 305–309. Joshua Albrecht, Rebecca Hwa, and G. Elisabeta Marai. 2009. The Chinese Room: Visualization and Interaction to Understand and Correct Ambiguous Machine Translation. Comput. Graph. Forum, 28(3): 1047–1054. Miriam Butt, Tina B ¨ogel, Annette Hautli, Sebastian Sulger, and Tafseer Ahmed. 2012. Identifying Urdu Complex Predication via Bigram Extraction. In In Proceedings of COLING 2012, Technical Papers, pages 409 424, Mumbai, India. Christopher Collins, M. Sheelagh T. Carpendale, and Gerald Penn. 2007. Visualization of Uncertainty in Lattices to Support Decision-Making. In EuroVis 2007, pages 5 1–58. Eurographics Association. Kris Heylen, Dirk Speelman, and Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of Semantic Vector Spaces for Dutch – synsets. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 16–24. Daniel A. Keim and Daniela Oelke. 2007. Literature Fingerprinting: A New Method for Visual Literary Analysis. In IEEE VAST 2007, pages 115–122. IEEE. Thomas Mayer, Christian Rohrdantz, Miriam Butt, Frans Plank, and Daniel A. Keim. 2010a. Visualizing Vowel Harmony. Linguistic Issues in Language Technology, 4(Issue 2): 1–33, December. Thomas Mayer, Christian Rohrdantz, Frans Plank, Peter Bak, Miriam Butt, and Daniel A. Keim. 2010b. Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint. In Proceedings of the 2010 Workshop on NLP andLinguistics: Finding the Common Ground, pages 70–78, Uppsala, Sweden, July. Association for Computational Linguistics. Tara Mohanan. 1994. Argument Structure in Hindi. Stanford: CSLI Publications. Christian Rohrdantz, Annette Hautli, Thomas Mayer, Miriam Butt, Frans Plank, and Daniel A. Keim. 2011. Towards Tracking Semantic Change by Visual Analytics. In ACL 2011 (Short Papers), pages 305–3 10, Portland, Oregon, USA, June. Association for Computational Linguistics. Christian Rohrdantz, Michael Hund, Thomas Mayer, Bernhard W ¨alchli, and Daniel A. Keim. 2012a. The World’s Languages Explorer: Visual Analysis of Language Features in Genealogical and Areal Contexts. Computer Graphics Forum, 3 1(3):935–944. Christian Rohrdantz, Andreas Niekler, Annette Hautli, Miriam Butt, and Daniel A. Keim. 2012b. Lexical Semantics and Distribution of Suffixes - A Visual Analysis. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 7–15, April. Tobias Schreck, J ¨urgen Bernard, Tatiana von Landesberger, and J o¨rn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization, 8(1): 14–29. James J. Thomas and Kristin A. Cook. 2006. A Visual Analytics Agenda. IEEE Computer Graphics and Applications, 26(1): 10–13. Jian Zhao, Fanny Chevalier, Christopher Collins, and Ravin Balakrishnan. 2012. Facilitating Discourse Analysis with Interactive Visualization. IEEE Trans. Vis. Comput. Graph., 18(12):2639–2648. 114

2 0.9586609 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model

Author: Zede Zhu ; Miao Li ; Lei Chen ; Zhenxin Yang

Abstract: Comparable corpora are important basic resources in cross-language information processing. However, the existing methods of building comparable corpora, which use intertranslate words and relative features, cannot evaluate the topical relation between document pairs. This paper adopts the bilingual LDA model to predict the topical structures of the documents and proposes three algorithms of document similarity in different languages. Experiments show that the novel method can obtain similar documents with consistent top- ics own better adaptability and stability performance.

3 0.9568873 271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections

Author: Lili Kotlerman ; Nitin Madnani ; Aoife Cahill

Abstract: Pivoting on bilingual parallel corpora is a popular approach for paraphrase acquisition. Although such pivoted paraphrase collections have been successfully used to improve the performance of several different NLP applications, it is still difficult to get an intrinsic estimate of the quality and coverage of the paraphrases contained in these collections. We present ParaQuery, a tool that helps a user interactively explore and characterize a given pivoted paraphrase collection, analyze its utility for a particular domain, and compare it to other popular lexical similarity resources all within a single interface.

same-paper 4 0.94841152 184 acl-2013-Identification of Speakers in Novels

Author: Hua He ; Denilson Barbosa ; Grzegorz Kondrak

5 0.9281714 128 acl-2013-Does Korean defeat phonotactic word segmentation?

Author: Robert Daland ; Kie Zuraw

Abstract: Computational models of infant word segmentation have not been tested on a wide range of languages. This paper applies a phonotactic segmentation model to Korean. In contrast to the undersegmentation pattern previously found in English and Russian, the model exhibited more oversegmentation errors and more errors overall. Despite the high error rate, analysis suggested that lexical acquisition might not be problematic, provided that infants attend only to frequently segmented items. 1

6 0.92364568 229 acl-2013-Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition

7 0.87966847 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization

8 0.82261348 244 acl-2013-Mining Opinion Words and Opinion Targets in a Two-Stage Framework

9 0.7123062 279 acl-2013-PhonMatrix: Visualizing co-occurrence constraints of sounds

10 0.65624946 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

11 0.63676918 79 acl-2013-Character-to-Character Sentiment Analysis in Shakespeare's Plays

12 0.60805869 140 acl-2013-Evaluating Text Segmentation using Boundary Edit Distance

13 0.60729468 230 acl-2013-Lightly Supervised Learning of Procedural Dialog Systems

14 0.60462403 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

15 0.60320246 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

16 0.60107076 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

17 0.59260321 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions

18 0.58253443 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

19 0.58011335 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis

20 0.57263839 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction