acl acl2011 acl2011-223 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Derya Ozkan ; Louis-Philippe Morency
Abstract: In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as ”wisdom of crowds”. In this paper, we propose a new approach for modeling wisdom of crowds based on the Latent Mixture of Discriminative Experts (LMDE) model that can automatically learn the prototypical patterns and hidden dynamic among different experts. Experiments show improvement over state-of-the-art approaches on the task of listener backchannel prediction in dyadic conversations.
Reference: text
sentIndex sentText sentNum sentScore
1 edu ct , Abstract In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as ”wisdom of crowds”. [sent-3, score-0.054]
2 In this paper, we propose a new approach for modeling wisdom of crowds based on the Latent Mixture of Discriminative Experts (LMDE) model that can automatically learn the prototypical patterns and hidden dynamic among different experts. [sent-4, score-0.907]
3 Experiments show improvement over state-of-the-art approaches on the task of listener backchannel prediction in dyadic conversations. [sent-5, score-1.038]
4 1 Introduction In many real life scenarios, it is hard to collect the actual labels for training, because it is expensive or the labeling is subjective. [sent-6, score-0.075]
5 In simple words, wisdom of crowds enables parallel acquisition of opinions from multiple annotators/experts. [sent-8, score-0.562]
6 In this paper, we propose a new method to fuse wisdom of crowds. [sent-9, score-0.354]
7 Our approach is based on the Latent Mixture of Discriminative Experts (LMDE) model originally introduced for multimodal fusion (Ozkan et al. [sent-10, score-0.139]
8 In our Wisdom-LMDE model, a discriminative expert is trained for each crowd member. [sent-12, score-0.079]
9 The key advantage of our computational model is that it can automatically discover the prototypical patterns of experts and learn the dynamic between these patterns. [sent-13, score-0.469]
10 335 We validate our model on the challenging task of listener backchannel feedback prediction in dyadic conversations. [sent-15, score-1.19]
11 Backchannel feedback includes the nods and paraverbals such as ”uh-huh” and ”mmhmm” that listeners produce as they are speaking. [sent-16, score-0.331]
12 Backchannels play a significant role in determining the nature of a social exchange by showing rapport and engagement (Gratch et al. [sent-17, score-0.202]
13 Supporting such fluid interactions has become an important topic of virtual human research. [sent-20, score-0.093]
14 In particular, backchannel feedback has received considerable interest due to its pervasiveness across languages and conversational contexts. [sent-21, score-0.515]
15 By correctly predicting backchannel feedback, virtual agent and robots can have stronger sense of rapport. [sent-22, score-0.54]
16 What makes backchannel prediction task wellsuited for our model is that listener feedback varies between people and is often optional (listeners can always decide to give feedback or not). [sent-23, score-1.222]
17 A successful computational model of backchannel must be able to learn these variations among listeners. [sent-24, score-0.503]
18 In our experiments, we validate the performance of our approach using a dataset of 43 storytelling dyadic interactions. [sent-26, score-0.161]
19 Right: Baseline models used in our experiments: a) Conditional Random Fields (CRF), b) Latent Dynamic Conditional Random Fields (LDCRF), c) CRF Mixture of Experts (no latent variable) totypical patterns for backchannel feedback. [sent-30, score-0.506]
20 By automatically identifying these prototypical patterns and learning the dynamic, our Wisdom-LMDE model outperforms the previous approaches for listener backchannel prediction. [sent-31, score-1.145]
21 1 Previous Work Several researchers have developed models to predict when backchannel should happen. [sent-33, score-0.41]
22 Ward and Tsukahara (2000) propose a unimodal approach where backchannels are associated with a region of low pitch lasting 110ms during speech. [sent-34, score-0.201]
23 (2007) present a unimodal decision-tree approach for producing backchannels based on prosodic features. [sent-36, score-0.182]
24 (2003) propose a unimodal model based on pause duration and trigram part-ofspeech frequency. [sent-38, score-0.116]
25 Wisdom of crowds was first defined and used in business world by Surowiecki (2004). [sent-39, score-0.184]
26 (2010) proposed a probabilistic approach for supervised learning tasks for which multiple annotators provide labels but not an absolute gold standard. [sent-43, score-0.054]
27 (2008) show that using non-expert labels for training machine learning algorithms can be as effective as using a gold standard annotation. [sent-46, score-0.033]
28 In this paper, we present a computational approach for listener backchannel prediction that exploits multiple listeners. [sent-47, score-0.956]
29 Our model takes into ac336 count the differences in people’s reactions, and automatically learns the hidden structure among them. [sent-48, score-0.088]
30 In Section 2, we present the wisdom acquisition process. [sent-50, score-0.38]
31 Then, we describe our Wisdom-LMDE model in Section 3. [sent-51, score-0.026]
32 2 Wisdom Acquisition It is known that culture, age and gender affect people’s nonverbal behaviors (Linda L. [sent-54, score-0.081]
33 Therefore, there might be variations among people’s reactions even when experiencing the same situation. [sent-56, score-0.094]
34 To efficiently acquire responses from multiple listeners, we employ the Parasocial Consensus Sampling (PCS) paradigm (Huang et al. [sent-57, score-0.103]
35 , 2010), which is based on the theory that people behave similarly when interacting through a media (e. [sent-58, score-0.052]
36 (2010) showed that a virtual human driven by PCS approach creates significantly more rapport and is perceived as more believable than the virtual human driven by face-to-face interaction data (from actual listener). [sent-62, score-0.417]
37 This result indicates that the parasocial paradigm is a viable source of information for wisdom of crowds. [sent-63, score-0.436]
38 In our experiments, we used 43 video-recorded dyadic interactions from the RAPPORT1 dataset (Gratch et al. [sent-67, score-0.103]
39 The videos of the actual listeners were manually annotated for backchannel feedback. [sent-70, score-0.677]
40 For PCS wisdom acquisition, we recruited 9 participants, who were told to pretend they are an active listener and press the keyboard whenever they felt like providing backchannel feedback. [sent-71, score-1.25]
41 This provides us the responses from multiple listeners all interacting with the same speaker, hence the wisdom necessary to model the variability among listeners. [sent-72, score-0.72]
42 3 Modeling Wisdom of Crowds Given the wisdom ofmultiple listeners, our goal is to create a computational model of backchannel feedback. [sent-73, score-0.79]
43 Although listener responses vary among individuals, we expect some patterns in these responses. [sent-74, score-0.606]
44 Therefore, we first analyze the most predictive features for each listener and search for prototypical patterns (in Section 3. [sent-75, score-0.729]
45 Then, we present our Wisdom-LMDE that allows to automatically learn the hidden structure within listener responses. [sent-77, score-0.529]
46 1 Wisdom Analysis We analyzed our wisdom data to see the most relevant speaker features when predicting responses from each individual listener. [sent-79, score-0.565]
47 (The complete list of speaker features are described in Section 4. [sent-80, score-0.111]
48 It allows us to identify the speaker features most predictive of each listener backchannel feedback. [sent-83, score-0.988]
49 The top 3 features for all 9 listeners are listed in Table 1. [sent-84, score-0.204]
50 For the first 3 listeners, pause in speech and syntac1http://rapport. [sent-86, score-0.027]
51 The next 3 experts include a prosodic feature, low pitch, which is coherent with earlier findings (Nishimura et al. [sent-90, score-0.188]
52 It is interesting to see that the last 3 experts incorporate visual information when predicting backchannel feedback. [sent-92, score-0.586]
53 , 1995) work showing that speaker gestures are often correlated with listener feedback. [sent-95, score-0.629]
54 These results clearly suggest that variations be present among listeners and some prototypical patterns may exist. [sent-96, score-0.47]
55 Based on these observations, we propose new computational model for listener backchannel. [sent-97, score-0.493]
56 2 Computational Model: Wisdom-LMDE The goals of our computational model are to automatically discover the prototypical patterns of backchannel feedback and learn the dynamic between these patterns. [sent-99, score-0.864]
57 This will allow the computational model to accurately predict the responses of a new listener even if he/she changes her backchannel patterns in the middle of the interaction. [sent-100, score-1.019]
58 It will also improve generalization by allowing mixtures of these prototypical patterns. [sent-101, score-0.189]
59 , 2010) which takes full advantage of the wisdom of crowds. [sent-103, score-0.354]
60 In our Wisdom-LMDE, each expert corresponds to a different listener from the wisdom of crowds. [sent-105, score-0.868]
61 As motivated earlier, we focus our experiments on predicting listener backchannel since it is a well-suited task where variability exists among listeners. [sent-109, score-0.962]
62 1 Multimodal Speaker Features The speaker videos were transcribed and annotated to extract the following features: Lexical: Some studies have suggested an asso- ciation between lexical features and listener feedback (Cathcart et al. [sent-111, score-0.724]
63 (POS) tagger and a data-driven Prosody: Prosody refers to the rhythm, pitch and of speech. [sent-117, score-0.068]
64 Several studies have demonstrated that listener feedback is correlated with a speaker’s prosody (Ward and Tsukahara, 2000; Nishimura et al. [sent-118, score-0.644]
65 Following this, we use downslope in pitch, pitch regions lower than 26th percentile, drop/rise and fast drop/rise in energy of speech, vowel volume, pause. [sent-120, score-0.068]
66 intonation Visual gestures: Gestures performed by the speaker with listener feedback (Burgoon et al. [sent-121, score-0.663]
67 Eye gaze, in particular, has often been implicated as eliciting listener feedback. [sent-123, score-0.467]
68 Thus, we are often correlated encode the following contextual features: speaker looking at listener, smiling, moving eyebrows up and frowning. [sent-124, score-0.114]
69 Although our current method for extracting these features requires that the entire utterance to be available for processing, this provides us with a first step towards integrating information about syntactic structure in multimodal prediction models. [sent-125, score-0.188]
70 Many of these features could in principle be computed incrementally with only a slight degradation in accu338 racy, with the exception of features that require dependency links where a word’s syntactic head is to the right of the word itself. [sent-126, score-0.04]
71 2 Baseline Models Consensus Classifier In our first baseline model, we use consensus labels to train a CRF model, which are constructed by a similar approach presented in (Huang et al. [sent-129, score-0.098]
72 The consensus threshold is set to 3 (at least 3 listeners agree to give feedback at a point) so that it contains approximately the same number of head nods as the actual listener. [sent-131, score-0.438]
73 CRF Mixture of Experts To show the importance of latent variable in our Wisdom-LMDE model, we trained a CRF-based mixture of discriminative experts. [sent-133, score-0.169]
74 This model is similar to the Logarithmic Opinion Pool (LOP) CRF suggested by Smith et al. [sent-134, score-0.026]
75 A graphical representation of a CRF Mixture of experts is given in the Figure 1. [sent-137, score-0.139]
76 Actual Listener (AL) Classifiers This baseline model consists of two models: CRF and LDCRF chains (See Figure 1). [sent-138, score-0.026]
77 To train these models, we use the labels of the ”Actual Listeners” (AL) from the RAPPORT dataset. [sent-139, score-0.033]
78 Multimodal LMDE In this baseline model, we compare our Wisdom LMDE to a multimodal LMDE, where each expert refers to one of 5 different set of multimodal features as presented in (Ozkan et al. [sent-140, score-0.245]
79 Random Classifier Our last baseline model is a random backchannel generator as desribed by Ward and Tsukahara (2000). [sent-142, score-0.459]
80 This model randomly generates backchannels whenever some pre-defined conditions in the prosody of the speech is purveyed. [sent-143, score-0.164]
81 Regulariza- Table 2: Comparison of our Wisdom-LMDE model with previously proposed models. [sent-148, score-0.026]
82 Numbers of hidden states used in the LDCRF models were 2, 3 and 4. [sent-153, score-0.039]
83 A backchannel is predicted correctly if a peak happens during an actual listener backchannel with high enough probability. [sent-159, score-1.329]
84 4 Results and Discussion Before reviewing the prediction results, is it important to remember that backchannel feedback is an optional phenomena, where the actual listener may or may not decide on giving feedback (Ward and Tsukahara, 2000). [sent-162, score-1.21]
85 Therefore, results from prediction tasks are expected to have lower accuracies as opposed to recognition tasks where labels are directly observed (e. [sent-163, score-0.091]
86 Table 2 summarizes our experiments comparing our Wisdom-LMDE model with state-of-the-art approaches for behavior prediction (see Section 4. [sent-166, score-0.109]
87 The second best F1 score is achieved by CRF Mixture of experts, which is the only model among other baseline models that combines different lis- tener labels in a late fusion manner. [sent-170, score-0.106]
88 net/projects/hrcf/ 339 supports our claim that wisdom of clouds improves learning of prediction models. [sent-172, score-0.412]
89 CRF Mixture model is a linear combination of the experts, whereas Wisdom-LMDE enables different weighting of experts at different point in time. [sent-173, score-0.165]
90 By using hidden states, Wisdom-LMDE can automatically learn the prototypical patterns between listeners. [sent-174, score-0.304]
91 One really interesting result is that the optimal number of hidden states in the Wisdom-LMDE model (after cross-validation) is 3. [sent-175, score-0.065]
92 5 Conclusions In this paper, we proposed a new approach called Wisdom-LMDE for modeling wisdom of crowds, which automatically learns the hidden structure in listener responses. [sent-178, score-0.86]
93 We applied this method on the task of listener backchannel feedback prediction, and showed improvement over previous ap- proaches. [sent-179, score-0.982]
94 Both our qualitative analysis and experimental results suggest that prototypical patterns exist when predicting listener backchannel feedback. [sent-180, score-1.175]
95 The Wisdom-LMDE is a generic model applicable to multiple sequence labeling tasks (such as emotion analysis and dialogue intent recognition), where labels are subjective (i. [sent-181, score-0.08]
96 A shallow model of backchannel continuers in spoken dialogue. [sent-203, score-0.436]
97 Parasocial consensus sampling: combining multiple perspectives to learn virtual human behavior. [sent-240, score-0.202]
98 Latent mixture of discriminative experts for multimodal 340 prediction modeling. [sent-281, score-0.412]
99 Failure of rapport: Why psychotheraputic engagement fails in the treatment of asian clients. [sent-314, score-0.031]
100 Prosodic features which cue back-channel responses in english and japanese. [sent-320, score-0.083]
wordName wordTfidf (topN-words)
[('listener', 0.467), ('backchannel', 0.41), ('wisdom', 0.354), ('prototypical', 0.189), ('listeners', 0.184), ('lmde', 0.169), ('ozkan', 0.169), ('crowds', 0.161), ('rapport', 0.149), ('experts', 0.139), ('morency', 0.127), ('crf', 0.115), ('ldcrf', 0.106), ('tsukahara', 0.106), ('feedback', 0.105), ('dyadic', 0.103), ('mixture', 0.094), ('virtual', 0.093), ('speaker', 0.091), ('multimodal', 0.089), ('gratch', 0.086), ('nishimura', 0.085), ('nonverbal', 0.081), ('burgoon', 0.074), ('pcs', 0.074), ('backchannels', 0.07), ('linda', 0.069), ('pitch', 0.068), ('consensus', 0.065), ('parasocial', 0.063), ('unimodal', 0.063), ('responses', 0.063), ('prediction', 0.058), ('cathcart', 0.056), ('patterns', 0.053), ('prosodic', 0.049), ('prosody', 0.049), ('gestures', 0.048), ('expert', 0.047), ('latent', 0.043), ('actual', 0.042), ('carli', 0.042), ('drolet', 0.042), ('hcrf', 0.042), ('nods', 0.042), ('raykar', 0.042), ('surowiecki', 0.042), ('tsui', 0.042), ('wisdomlmde', 0.042), ('agents', 0.041), ('videos', 0.041), ('ward', 0.041), ('hidden', 0.039), ('dynamic', 0.039), ('storytelling', 0.037), ('predicting', 0.037), ('sagae', 0.036), ('labels', 0.033), ('sage', 0.032), ('discriminative', 0.032), ('reactions', 0.031), ('iva', 0.031), ('engagement', 0.031), ('logarithmic', 0.029), ('people', 0.028), ('pause', 0.027), ('child', 0.027), ('acquisition', 0.026), ('model', 0.026), ('culture', 0.026), ('huang', 0.025), ('behavior', 0.025), ('variability', 0.025), ('conflict', 0.025), ('interacting', 0.024), ('fusion', 0.024), ('random', 0.023), ('optional', 0.023), ('business', 0.023), ('learn', 0.023), ('among', 0.023), ('correlated', 0.023), ('conditional', 0.022), ('social', 0.022), ('validate', 0.021), ('care', 0.021), ('multiple', 0.021), ('variations', 0.021), ('utterance', 0.021), ('driven', 0.02), ('scenarios', 0.02), ('snow', 0.02), ('features', 0.02), ('qualitative', 0.019), ('whenever', 0.019), ('paradigm', 0.019), ('goals', 0.019), ('examiner', 0.019), ('experiencing', 0.019), ('classrooms', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
Author: Derya Ozkan ; Louis-Philippe Morency
Abstract: In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as ”wisdom of crowds”. In this paper, we propose a new approach for modeling wisdom of crowds based on the Latent Mixture of Discriminative Experts (LMDE) model that can automatically learn the prototypical patterns and hidden dynamic among different experts. Experiments show improvement over state-of-the-art approaches on the task of listener backchannel prediction in dyadic conversations.
2 0.29530412 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
Author: Siwei Wang ; Gina-Anne Levow
Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.
3 0.10112531 118 acl-2011-Entrainment in Speech Preceding Backchannels.
Author: Rivka Levitan ; Agustin Gravano ; Julia Hirschberg
Abstract: In conversation, when speech is followed by a backchannel, evidence of continued engagement by one’s dialogue partner, that speech displays a combination of cues that appear to signal to one’s interlocutor that a backchannel is appropriate. We term these cues backchannel-preceding cues (BPC)s, and examine the Columbia Games Corpus for evidence of entrainment on such cues. Entrainment, the phenomenon of dialogue partners becoming more similar to each other, is widely believed to be crucial to conversation quality and success. Our results show that speaking partners entrain on BPCs; that is, they tend to use similar sets of BPCs; this similarity increases over the course of a dialogue; and this similarity is associated with measures of dialogue coordination and task success. 1
4 0.085352466 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
5 0.078595117 252 acl-2011-Prototyping virtual instructors from human-human corpora
Author: Luciana Benotti ; Alexandre Denis
Abstract: Virtual instructors can be used in several applications, ranging from trainers in simulated worlds to non player characters for virtual games. In this paper we present a novel algorithm for rapidly prototyping virtual instructors from human-human corpora without manual annotation. Automatically prototyping full-fledged dialogue systems from corpora is far from being a reality nowadays. Our algorithm is restricted in that only the virtual instructor can perform speech acts while the user responses are limited to physical actions in the virtual world. We evaluate a virtual instructor, generated using this algorithm, with human users. We compare our results both with human instructors and rule-based virtual instructors hand-coded for the same task.
6 0.076982334 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
7 0.074445367 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
8 0.066798367 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
9 0.057225302 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
10 0.05249472 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
11 0.047417212 261 acl-2011-Recognizing Named Entities in Tweets
12 0.045133669 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
13 0.043399721 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System
14 0.042516999 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
15 0.041176118 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications
16 0.039144266 101 acl-2011-Disentangling Chat with Local Coherence Models
17 0.038076073 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus
18 0.037230097 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
19 0.036499631 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL
20 0.03567408 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
topicId topicWeight
[(0, 0.104), (1, 0.029), (2, -0.013), (3, -0.006), (4, -0.089), (5, 0.078), (6, 0.005), (7, -0.002), (8, 0.013), (9, 0.02), (10, -0.021), (11, -0.01), (12, -0.025), (13, 0.029), (14, 0.018), (15, -0.013), (16, -0.059), (17, -0.017), (18, 0.018), (19, -0.068), (20, 0.057), (21, -0.036), (22, -0.079), (23, 0.152), (24, -0.014), (25, 0.025), (26, 0.182), (27, -0.05), (28, -0.059), (29, -0.016), (30, -0.077), (31, 0.088), (32, 0.06), (33, -0.014), (34, -0.007), (35, 0.04), (36, 0.08), (37, 0.053), (38, -0.032), (39, 0.067), (40, 0.049), (41, -0.028), (42, 0.04), (43, -0.017), (44, 0.014), (45, 0.03), (46, -0.06), (47, -0.001), (48, -0.047), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.88628864 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
Author: Derya Ozkan ; Louis-Philippe Morency
Abstract: In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as ”wisdom of crowds”. In this paper, we propose a new approach for modeling wisdom of crowds based on the Latent Mixture of Discriminative Experts (LMDE) model that can automatically learn the prototypical patterns and hidden dynamic among different experts. Experiments show improvement over state-of-the-art approaches on the task of listener backchannel prediction in dyadic conversations.
2 0.86184245 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
Author: Siwei Wang ; Gina-Anne Levow
Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.
3 0.75614244 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
4 0.72197092 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond
Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.
5 0.67416233 118 acl-2011-Entrainment in Speech Preceding Backchannels.
Author: Rivka Levitan ; Agustin Gravano ; Julia Hirschberg
Abstract: In conversation, when speech is followed by a backchannel, evidence of continued engagement by one’s dialogue partner, that speech displays a combination of cues that appear to signal to one’s interlocutor that a backchannel is appropriate. We term these cues backchannel-preceding cues (BPC)s, and examine the Columbia Games Corpus for evidence of entrainment on such cues. Entrainment, the phenomenon of dialogue partners becoming more similar to each other, is widely believed to be crucial to conversation quality and success. Our results show that speaking partners entrain on BPCs; that is, they tend to use similar sets of BPCs; this similarity increases over the course of a dialogue; and this similarity is associated with measures of dialogue coordination and task success. 1
6 0.65824664 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus
7 0.62808776 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
8 0.57617086 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
10 0.48516807 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
11 0.44056764 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
12 0.43149003 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
13 0.42690459 133 acl-2011-Extracting Social Power Relationships from Natural Language
14 0.42367965 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications
15 0.41957119 74 acl-2011-Combining Indicators of Allophony
16 0.41234475 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling
17 0.40486088 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
18 0.39078403 55 acl-2011-Automatically Predicting Peer-Review Helpfulness
19 0.38431889 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
20 0.36589998 252 acl-2011-Prototyping virtual instructors from human-human corpora
topicId topicWeight
[(1, 0.011), (5, 0.041), (17, 0.038), (26, 0.016), (37, 0.077), (39, 0.04), (41, 0.16), (55, 0.015), (59, 0.032), (72, 0.05), (77, 0.255), (91, 0.054), (96, 0.099)]
simIndex simValue paperId paperTitle
1 0.84268606 342 acl-2011-full-for-print
Author: Kuzman Ganchev
Abstract: unkown-abstract
same-paper 2 0.80381376 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
Author: Derya Ozkan ; Louis-Philippe Morency
Abstract: In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as ”wisdom of crowds”. In this paper, we propose a new approach for modeling wisdom of crowds based on the Latent Mixture of Discriminative Experts (LMDE) model that can automatically learn the prototypical patterns and hidden dynamic among different experts. Experiments show improvement over state-of-the-art approaches on the task of listener backchannel prediction in dyadic conversations.
3 0.63744766 189 acl-2011-K-means Clustering with Feature Hashing
Author: Hajime Senuma
Abstract: One of the major problems of K-means is that one must use dense vectors for its centroids, and therefore it is infeasible to store such huge vectors in memory when the feature space is high-dimensional. We address this issue by using feature hashing (Weinberger et al., 2009), a dimension-reduction technique, which can reduce the size of dense vectors while retaining sparsity of sparse vectors. Our analysis gives theoretical motivation and justification for applying feature hashing to Kmeans, by showing how much will the objective of K-means be (additively) distorted. Furthermore, to empirically verify our method, we experimented on a document clustering task.
4 0.63645768 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars
Author: Antske Fokkens
Abstract: When designing grammars of natural language, typically, more than one formal analysis can account for a given phenomenon. Moreover, because analyses interact, the choices made by the engineer influence the possibilities available in further grammar development. The order in which phenomena are treated may therefore have a major impact on the resulting grammar. This paper proposes to tackle this problem by using metagrammar development as a methodology for grammar engineering. Iargue that metagrammar engineering as an approach facilitates the systematic exploration of grammars through comparison of competing analyses. The idea is illustrated through a comparative study of auxiliary structures in HPSG-based grammars for German and Dutch. Auxiliaries form a central phenomenon of German and Dutch and are likely to influence many components of the grammar. This study shows that a special auxiliary+verb construction significantly improves efficiency compared to the standard argument-composition analysis for both parsing and generation.
Author: Fabrizio Morbini ; Kenji Sagae
Abstract: Individual utterances often serve multiple communicative purposes in dialogue. We present a data-driven approach for identification of multiple dialogue acts in single utterances in the context of dialogue systems with limited training data. Our approach results in significantly increased understanding of user intent, compared to two strong baselines.
6 0.6286999 328 acl-2011-Using Cross-Entity Inference to Improve Event Extraction
7 0.62786436 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
8 0.62757909 56 acl-2011-Bayesian Inference for Zodiac and Other Homophonic Ciphers
9 0.62393618 232 acl-2011-Nonparametric Bayesian Machine Transliteration with Synchronous Adaptor Grammars
10 0.62384158 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations
11 0.62298954 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing
12 0.60125005 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
13 0.58709598 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
14 0.57611465 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
15 0.57585382 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
16 0.57420611 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents
17 0.57239795 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue
18 0.57078362 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
19 0.56977892 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
20 0.56942821 94 acl-2011-Deciphering Foreign Language