acl acl2011 acl2011-228 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
Reference: text
sentIndex sentText sentNum sentScore
1 com a Abstract In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. [sent-5, score-0.483]
2 N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. [sent-7, score-0.429]
3 This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition. [sent-10, score-0.661]
4 Speakers use prosody to convey paralinguistic information such as emphasis, intention, attitude, and emotion. [sent-12, score-0.363]
5 Humans listening to speech with natural prosody are able to understand the content with low cognitive load and high accuracy. [sent-13, score-0.443]
6 They miss useful information contained in the prosody of the speech that may help recognition. [sent-16, score-0.443]
7 Recently a lot of research has been done in automatic annotation of prosodic events (Wightman and Ostendorf, 1994; Sridhar et al. [sent-17, score-0.619]
8 They used acoustic and lexical-syntactic cues to annotate prosodic events with a variety of machine learning approaches and achieved good performance. [sent-19, score-0.956]
9 There are also many studies using prosodic information for various spoken language understanding tasks. [sent-20, score-0.581]
10 However, research using prosodic knowledge for speech recognition is still quite limited. [sent-21, score-0.722]
11 In this study, we investigate leveraging prosodic information for recognition in an n-best rescoring framework. [sent-22, score-0.958]
12 Previous studies showed that prosodic events, such as pitch-accent, are closely related with acoustic prosodic cues and lexical structure of utterance. [sent-23, score-1.534]
13 The pitch-accent pattern given acoustic signal is strongly correlated with lexical items, such as syllable identity and canonical stress pattern. [sent-24, score-0.58]
14 We develop two separate pitch-accent detection models, using acoustic (observation model) and lexical information (expectation model) respectively, and propose a scoring method for the correlation of pitch-accent patterns between the two models for recognition hypotheses. [sent-26, score-0.473]
15 The fact that it holds across different baseline systems suggests the possibility that prosody can be used to help improve speech recognition performance. [sent-33, score-0.504]
16 The use of prosody in speech understanding applications has been quite extensive. [sent-42, score-0.443]
17 Incorporating prosodic knowledge is expected to improve the performance of speech recognition. [sent-49, score-0.661]
18 However, how to effectively integrate prosody within the traditional ASR framework is a difficult problem, since prosodic features are not well defined and they come from a longer region, which is different from spectral features used in current ASR systems. [sent-50, score-0.978]
19 Various research has been conducted trying to incorporate prosodic information in ASR. [sent-51, score-0.581]
20 One way is to directly integrate prosodic features into the ASR framework (Vergyri et al. [sent-52, score-0.581]
21 This kind of integration has advantages in that spectral and prosodic features are more tightly coupled and jointly modeled. [sent-56, score-0.615]
22 Alternatively, prosody was modeled independently from the acoustic and language models of ASR and used to rescore recognition hypotheses in the second pass. [sent-57, score-0.802]
23 This approach makes it possible to independently model and optimize the prosodic knowledge and to combine with ASR hypotheses without any modification of the conventional ASR modules. [sent-58, score-0.609]
24 In order to improve the rescoring performance, various prosodic knowledge was studied. [sent-59, score-0.897]
25 (Ananthakrishnan and Narayanan, 2007) used acoustic pitch-accent pattern and its sequential information given lexical cues to rescore n-best hypotheses. [sent-60, score-0.411]
26 (Kalinli and Narayanan, 2009) used acoustic prosodic cues such as pitch and duration along with other knowledge to choose a proper word among several candidates in confusion networks. [sent-61, score-1.057]
27 We take a similar approach in this study as the second approach above in that we develop prosodic models separately and use them in a rescoring framework. [sent-63, score-0.917]
28 In our approach, we explicitly model the symbolic prosodic events based on acoustic and lexical information. [sent-65, score-0.945]
29 We then capture the correlation of pitch-accent patterns between the two different cues, and use that to improve recognition performance in an n-best rescoring paradigm. [sent-66, score-0.403]
30 Since pitch-accent is usually carried by syllables, we use syllables as our units, and the syllable definition of each word is based on CMU pronunciation dictionary which has lexical stress and syllable boundary marks (Bartlett et al. [sent-69, score-0.629]
31 1 Acoustic-prosodic Features Similar to most previous work, the prosodic features we use include pitch, energy, and duration. [sent-73, score-0.581]
32 These are used widely in prosodic event detection and emotion detection. [sent-79, score-0.621]
33 3 Prosodic Model Training We choose to use a support vector machine (SVM) classifier1 for the prosodic model based on previous work on prosody labeling study in (Jeon and Liu, 2010). [sent-103, score-0.944]
34 In our experiments, we investigate two kinds of training methods for prosodic modeling. [sent-105, score-0.581]
35 It uses Li (i = 1, 2) to train two distinct classifiers: the acoustic classifier h1, and the lexical classifier h2. [sent-115, score-0.368]
36 This co-training method is expected to cope with two problems in prosodic model training. [sent-122, score-0.581]
37 Instead of using the prosodic model in the first pass decoding, we use it to rescore n-best candidates from a speech recognizer. [sent-129, score-0.7]
38 This allows us to train the prosody models independently and better optimize the models. [sent-130, score-0.383]
39 For p(Ap|W), the prosody score for a word sequence W, i|nW th)i,s hweor pkr we propose a omre ath wodo rtdo estimate it, also represented as scoreW−prosody (W). [sent-131, score-0.382]
40 The idea of scoring the prosody patterns is that there is some expectation of pitch-accent patterns given the lexical sequence (W), and the acoustic pitchaccent should match with this expectation. [sent-132, score-0.792]
41 In order to maximize the agreement between the two sources, we measure how good the acoustic pitch-accent in speech signal matches the given lexical cues. [sent-134, score-0.406]
42 For each syllable Si in the n-best list, we use acoustic-prosodic cues (ai) to estimate the posterior probability that the syllable is prominent (P), p(P|ai). [sent-135, score-0.465]
43 We notice that syllables without pitch-accent have much shorter duration than the prominent ones, and the prosody scores for the short syllables tend to be high. [sent-138, score-0.77]
44 Part of the data has been labeled with ToBI-style prosodic annotations. [sent-147, score-0.581]
45 In fact, the reason that we use this corpus, instead of other corpora typically used for ASR experiments, is because of its prosodic labels. [sent-148, score-0.581]
46 7 hours of speech (4,234 utterances) for the co-training algorithm for the prosodic models. [sent-169, score-0.661]
47 For prosodic models, we used a simple binary representation of pitch-accent in the form of presence versus absence. [sent-170, score-0.581]
48 For rescoring, not only the accuracies of the two individual prosodic models are important, but also the pitch-accent agreement score between the two models (as shown in Equation 3) is critical, therefore, we present results using these two metrics. [sent-183, score-0.64]
49 Table 1 shows the accuracy of each model for pitch-accent detection, and also the average prosody score of the two models (i. [sent-184, score-0.402]
50 3, the overall accuracies improve slightly and therefore the prosody score is also increased. [sent-192, score-0.382]
51 c-8eo6g5r72e Table 1: Pitch accent detection results: performance of individual acoustic and lexical models, and the agreement between the two models (i. [sent-199, score-0.434]
52 , prosody score for a syllable, Equation 3) for positive and negative classes. [sent-201, score-0.382]
53 We apply the acoustic and lexical prosodic models to each hypothesis to obtain its prosody score, and combine it with ASR scores to find the top hypothesis. [sent-204, score-1.346]
54 Table 2 shows the rescoring results using the first recognition system on BU data, which was trained with a relatively small amount of data. [sent-207, score-0.377]
55 We used two prosodic models as described in Section 3. [sent-210, score-0.601]
56 The first one is the base prosodic model using supervised training (S-model). [sent-212, score-0.581]
57 The second is the prosodic model with the co-training algorithm (Cmodel). [sent-213, score-0.581]
58 For these rescoring experiments, we tuned λ (in Equation 5) when combining the ASR acoustic and language model scores with the additional prosody score. [sent-214, score-0.998]
59 Even though the prosodic event detection performance of these two prosodic models is similar, the improved prosody score between the acoustic and lexical prosodic models using co-training helps rescoring. [sent-219, score-2.531]
60 After rescoring using prosodic knowledge, the WER is reduced by 0. [sent-220, score-0.897]
61 8163 49% %) Table 2: WER of the baseline system and after rescoring using prosodic models. [sent-226, score-0.897]
62 test data is smaller when using the C-model than Smodel, which means that the prosodic model with co-training is more stable. [sent-228, score-0.581]
63 These verify again that the prosodic scores contribute more in the combination with ASR likelihood scores when using the C-model, and are more robust across different tuning sets. [sent-230, score-0.637]
64 Ananthakrishnan and Narayanan (2007) also used acoustic/lexical prosodic models to estimate a prosody score and reported 0. [sent-231, score-0.983]
65 3% recognition error reduction on BU data when rescoring 100-best list (their baseline WER is 22. [sent-232, score-0.401]
66 Next we test our n-best rescoring approach using a state-of-the-art SRI speech recognizer on BN data to verify if our approach can generalize to better ASR n-best lists. [sent-235, score-0.46]
67 First, the baseline ASR performance is higher, making further improvement hard; second, and more importantly, the prosody models do not match well to the test domain. [sent-245, score-0.383]
68 706873% %) Table 3: WER of the baseline system and after rescoring using prosodic models. [sent-251, score-0.897]
69 For a better understanding of the improvement using the prosody model, we analyzed the pattern of corrections (the new hypothesis after rescoring is correct while the original 1-best is wrong) and errors. [sent-257, score-0.707]
70 For example, when the acoustic classifier predicts a syllable as pitch-accented and the lexical one as not accented, ‘10’ marker is assigned to the syllable. [sent-264, score-0.54]
71 As shown in the positive example of Table 4, we find that our prosodic model is effective at identifying an erroneous word when it is split into two words, resulting in different pitch-accent patterns. [sent-267, score-0.581]
72 Language models are Positve xampler s1c-obre sdt: m (1 o1 s) sto (1 ft01h)e0 rt()0h )em (1 a 1s 0 s 0a 0c h 01 u s 0 e0 )t s Negative xampler s1c-obre sdt: (r 1 o 1 b 0 b0 e r 0 r y 0) a (n0 0d )d(lo1 1n0)t( oa(0 f0) th( 1 he1 )ef t Table 4: Examples of rescoring results. [sent-268, score-0.386]
73 We conducted more prosody rescoring experiments in order to understand the model behavior. [sent-272, score-0.679]
74 In the first experiment, among the 100 hypotheses in n-best list, we gave a prosody score of 0 to the 100th hypothesis, and used automatically obtained prosodic scores for the other hypotheses. [sent-274, score-1.019]
75 A zero prosody score means the perfect agreement given acoustic and lexical cues. [sent-275, score-0.708]
76 The original scores from the recognizer were combined with the prosodic scores for rescoring. [sent-276, score-0.701]
77 This was to verify that the range of the weighting factor λ estimated on the development data (using the original, not the modified prosody scores for all candidates) was reasonable to choose proper hypothesis among all the candidates. [sent-277, score-0.419]
78 This hypothesis has the highest prosodic scores, but lowest ASR score. [sent-279, score-0.609]
79 This result showed that if the prosodic models were accurate enough, the correct candidate could be chosen using our rescoring framework. [sent-280, score-0.917]
80 We use the same ASR scores for all candidates, and generated prosodic scores using our prosody model. [sent-282, score-1.0]
81 This was to test that our model could pick up correct candidate using only the prosodic score. [sent-283, score-0.581]
82 , 1/100), suggest739 ing the benefit of the prosody model; however, this percentage is not very high, implying the limitation of prosodic information for ASR or the current imperfect prosodic models. [sent-287, score-1.525]
83 When using our prosody rescoring approach, we obtained a relative error rate reduction of 6. [sent-289, score-0.703]
84 This demonstrates again that our rescoring method works well if the correct hypothesis is on the list, even though with a low ASR score, using prosodic information can help identify the correct candidate. [sent-291, score-0.925]
85 Overall the performance improvement we obtained from rescoring by incorporating prosodic information is very promising. [sent-292, score-0.897]
86 Our evaluation using two different ASR systems shows that the improvement holds even when we use a state-of-the-art recognizer and the training data for the prosody model does not come from the same corpus. [sent-293, score-0.427]
87 – 7 Conclusion In this paper, we attempt to integrate prosodic information for ASR using an n-best rescoring scheme. [sent-295, score-0.897]
88 This approach decouples the prosodic model from the main ASR system, thus the prosodic model can be built independently. [sent-296, score-1.162]
89 The prosodic scores that we use for n-best rescoring are based on the matching of pitch-accent patterns by acoustic and lexical features. [sent-297, score-1.277]
90 The fact that the gain holds across different baseline systems (including a state-of-theart speech recognizer) suggests the possibility that prosody can be used to improve speech recognition performance. [sent-301, score-0.584]
91 As suggested by our experiments, better prosodic models can result in more WER reduction. [sent-302, score-0.601]
92 The performance of our prosodic model was improved with co-training, but there are still problems, such as the imbalance of the two classifiers’ prediction, as well as for the two events. [sent-303, score-0.581]
93 Since the prosodic features we use include cross-word contextual information, it is not straightforward to apply it directly to lattices. [sent-307, score-0.581]
94 Improved speech recognition using acoustic and lexical correlated of pitch accent in a n-best rescoring framework. [sent-311, score-0.912]
95 Automatic prosodic event detection using acoustic, lexical and syntactic evidence. [sent-316, score-0.656]
96 Modeling prosodic features with joint factor analysis for speaker verification. [sent-339, score-0.601]
97 Automatic prosodic events detection suing syllable-based acoustic and syntactic features. [sent-348, score-0.95]
98 Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. [sent-394, score-0.654]
99 Combining lexical, syntactic and prosodic cues for improved online dialog act tagging. [sent-398, score-0.627]
100 Speech recognition supported by prosodic information for fixed stress languages. [sent-407, score-0.678]
wordName wordTfidf (topN-words)
[('prosodic', 0.581), ('prosody', 0.363), ('asr', 0.337), ('rescoring', 0.316), ('acoustic', 0.291), ('syllable', 0.193), ('syllables', 0.144), ('wer', 0.119), ('bu', 0.114), ('jeon', 0.082), ('pitch', 0.081), ('speech', 0.08), ('recognizer', 0.064), ('shrikanth', 0.063), ('recognition', 0.061), ('duration', 0.058), ('ostendorf', 0.053), ('pitchaccent', 0.051), ('accent', 0.048), ('broadcast', 0.047), ('cues', 0.046), ('ananthakrishnan', 0.041), ('detection', 0.04), ('rescore', 0.039), ('sridhar', 0.039), ('radio', 0.039), ('events', 0.038), ('argmwaxp', 0.038), ('stress', 0.036), ('narayanan', 0.035), ('bn', 0.035), ('lexical', 0.035), ('spectral', 0.034), ('contour', 0.033), ('hun', 0.033), ('scorew', 0.033), ('shriberg', 0.033), ('prominent', 0.033), ('utterances', 0.033), ('vergyri', 0.031), ('parenthesis', 0.031), ('audio', 0.031), ('unlabeled', 0.031), ('icassp', 0.03), ('sri', 0.029), ('pronunciation', 0.028), ('hypotheses', 0.028), ('hypothesis', 0.028), ('scores', 0.028), ('mari', 0.028), ('je', 0.027), ('news', 0.026), ('patterns', 0.026), ('identity', 0.025), ('benus', 0.025), ('dehak', 0.025), ('dhi', 0.025), ('gadde', 0.025), ('grabe', 0.025), ('kalinli', 0.025), ('legendre', 0.025), ('rangarajan', 0.025), ('scoreasr', 0.025), ('szaszak', 0.025), ('tobi', 0.025), ('venkata', 0.025), ('wightman', 0.025), ('xampler', 0.025), ('utterance', 0.025), ('equation', 0.025), ('elizabeth', 0.024), ('reduction', 0.024), ('stolcke', 0.024), ('ap', 0.023), ('andreas', 0.022), ('energy', 0.022), ('bartlett', 0.022), ('durations', 0.022), ('dur', 0.022), ('dimitra', 0.022), ('ferrer', 0.022), ('sdt', 0.022), ('classifier', 0.021), ('interspeech', 0.021), ('rosenberg', 0.021), ('nelson', 0.021), ('hwang', 0.021), ('vivek', 0.021), ('dte', 0.021), ('rescored', 0.021), ('luciana', 0.021), ('sankaranarayanan', 0.021), ('models', 0.02), ('classifiers', 0.02), ('transactions', 0.02), ('speaker', 0.02), ('julia', 0.02), ('reference', 0.019), ('venkataraman', 0.019), ('score', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
2 0.36216009 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
Author: Anna Margolis ; Mari Ostendorf
Abstract: We investigate the use of textual Internet conversations for detecting questions in spoken conversations. We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation.
3 0.22259627 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond
Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.
4 0.1766371 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
Author: Siwei Wang ; Gina-Anne Levow
Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.
5 0.15413919 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
Author: Wei-Bin Liang ; Chung-Hsien Wu ; Chia-Ping Chen
Abstract: In this study, a novel approach to robust dialogue act detection for error-prone speech recognition in a spoken dialogue system is proposed. First, partial sentence trees are proposed to represent a speech recognition output sentence. Semantic information and the derivation rules of the partial sentence trees are extracted and used to model the relationship between the dialogue acts and the derivation rules. The constructed model is then used to generate a semantic score for dialogue act detection given an input speech utterance. The proposed approach is implemented and evaluated in a Mandarin spoken dialogue system for tour-guiding service. Combined with scores derived from the ASR recognition probability and the dialogue history, the proposed approach achieves 84.3% detection accuracy, an absolute improvement of 34.7% over the baseline of the semantic slot-based method with 49.6% detection accuracy.
6 0.15276414 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus
7 0.12184155 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
9 0.071899377 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition
10 0.066798367 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
11 0.063341476 118 acl-2011-Entrainment in Speech Preceding Backchannels.
12 0.06289535 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
13 0.059228264 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
14 0.056831811 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
15 0.044950563 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue
16 0.041928418 167 acl-2011-Improving Dependency Parsing with Semantic Classes
17 0.041883659 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
18 0.041461945 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
19 0.033753194 217 acl-2011-Machine Translation System Combination by Confusion Forest
20 0.032844577 122 acl-2011-Event Extraction as Dependency Parsing
topicId topicWeight
[(0, 0.113), (1, 0.028), (2, -0.026), (3, 0.01), (4, -0.149), (5, 0.156), (6, 0.01), (7, -0.026), (8, 0.044), (9, 0.061), (10, 0.024), (11, -0.019), (12, -0.003), (13, 0.054), (14, 0.066), (15, 0.031), (16, -0.078), (17, -0.051), (18, 0.084), (19, -0.132), (20, 0.059), (21, -0.144), (22, -0.137), (23, 0.275), (24, 0.023), (25, 0.074), (26, 0.273), (27, -0.101), (28, -0.103), (29, 0.052), (30, -0.163), (31, -0.006), (32, 0.177), (33, -0.076), (34, 0.006), (35, -0.021), (36, 0.131), (37, 0.015), (38, -0.052), (39, -0.003), (40, -0.095), (41, -0.065), (42, -0.067), (43, 0.134), (44, 0.046), (45, -0.025), (46, -0.042), (47, -0.062), (48, 0.128), (49, -0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.94503331 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
2 0.78804338 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
Author: Anna Margolis ; Mari Ostendorf
Abstract: We investigate the use of textual Internet conversations for detecting questions in spoken conversations. We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation.
3 0.76326615 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
Author: Siwei Wang ; Gina-Anne Levow
Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.
4 0.73408502 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond
Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.
5 0.65101182 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
Author: Derya Ozkan ; Louis-Philippe Morency
Abstract: In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as ”wisdom of crowds”. In this paper, we propose a new approach for modeling wisdom of crowds based on the Latent Mixture of Discriminative Experts (LMDE) model that can automatically learn the prototypical patterns and hidden dynamic among different experts. Experiments show improvement over state-of-the-art approaches on the task of listener backchannel prediction in dyadic conversations.
6 0.64532071 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus
7 0.62002951 118 acl-2011-Entrainment in Speech Preceding Backchannels.
9 0.40972584 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
10 0.38777611 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
11 0.34599444 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
12 0.34375489 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
13 0.33946002 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition
14 0.29524317 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
15 0.2882188 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
16 0.25199243 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis
17 0.24481834 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
18 0.23597459 55 acl-2011-Automatically Predicting Peer-Review Helpfulness
19 0.22600514 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
20 0.21960688 321 acl-2011-Unsupervised Discovery of Rhyme Schemes
topicId topicWeight
[(5, 0.025), (17, 0.041), (26, 0.013), (37, 0.077), (39, 0.036), (41, 0.076), (44, 0.013), (55, 0.038), (59, 0.033), (61, 0.262), (72, 0.027), (91, 0.048), (92, 0.012), (96, 0.154), (98, 0.017)]
simIndex simValue paperId paperTitle
1 0.81971478 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
Author: Youngjun Kim ; Ellen Riloff ; Stephane Meystre
Abstract: We present an NLP system that classifies the assertion type of medical problems in clinical notes used for the Fourth i2b2/VA Challenge. Our classifier uses a variety of linguistic features, including lexical, syntactic, lexicosyntactic, and contextual features. To overcome an extremely unbalanced distribution of assertion types in the data set, we focused our efforts on adding features specifically to improve the performance of minority classes. As a result, our system reached 94. 17% micro-averaged and 79.76% macro-averaged F1-measures, and showed substantial recall gains on the minority classes. 1
2 0.80894232 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes
Author: Bo Pang ; Ravi Kumar
Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —
same-paper 3 0.75456131 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
4 0.71679425 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
Author: Stefan Rud ; Massimiliano Ciaramita ; Jens Muller ; Hinrich Schutze
Abstract: We use search engine results to address a particularly difficult cross-domain language processing task, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in the search results as additional information for correctly classifying the token. We achieve strong gains in NER performance on news, in-domain and out-of-domain, and on web queries.
5 0.70110488 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization
Author: Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present a novel approach to grammatical error correction based on Alternating Structure Optimization. As part of our work, we introduce the NUS Corpus of Learner English (NUCLE), a fully annotated one million words corpus of learner English available for research purposes. We conduct an extensive evaluation for article and preposition errors using various feature sets. Our experiments show that our approach outperforms two baselines trained on non-learner text and learner text, respectively. Our approach also outperforms two commercial grammar checking software packages.
6 0.6135354 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
7 0.61299223 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
8 0.61246228 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
9 0.61175001 135 acl-2011-Faster and Smaller N-Gram Language Models
10 0.61158919 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
11 0.61053932 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
12 0.61041474 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
13 0.60994053 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
14 0.6098851 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
15 0.60978127 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base
16 0.6085031 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction
17 0.60784018 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
18 0.60674614 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts
19 0.60647368 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
20 0.6059767 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue