acl acl2011 acl2011-257 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anna Margolis ; Mari Ostendorf
Abstract: We investigate the use of textual Internet conversations for detecting questions in spoken conversations. We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We investigate the use of textual Internet conversations for detecting questions in spoken conversations. [sent-3, score-0.189]
2 We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. [sent-4, score-0.675]
3 Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. [sent-5, score-0.531]
4 We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation. [sent-6, score-0.85]
5 1 Introduction Automatic speech recognition systems, which transcribe words, are often augmented by subsequent processing for inserting punctuation or labeling speech acts. [sent-7, score-0.352]
6 Both prosodic features (extracted from the acoustic signal) and lexical features (extracted from the word sequence) have been shown to be useful for these tasks (Shriberg et al. [sent-8, score-0.629]
7 However, access to labeled speech training data is generally required in order to use prosodic features. [sent-11, score-0.565]
8 On the other hand, the Internet contains large quantities of textual data that is already labeled with punctuation, and which can be used to train a system using lexical features. [sent-12, score-0.087]
9 In this work, we focus on question detection in the Meeting Recorder Dialog Act corpus (MRDA) (Shriberg et al. [sent-13, score-0.185]
10 , 2004), using text sentences with question marks in Wikipedia “talk” 118 pages. [sent-14, score-0.105]
11 We compare the performance of a question detector trained on the text domain using lexical features with one trained on MRDA using lexical features and/or prosodic features. [sent-15, score-0.767]
12 In addition, we experiment with two unsupervised domain adaptation methods to incorporate unlabeled MRDA utterances into the text-based question detector. [sent-16, score-0.578]
13 The goal is to use the unlabeled domain-matched data to bridge stylistic differences as well as to incorporate the prosodic features, which are unavailable in the labeled text data. [sent-17, score-0.537]
14 2 Related Work Question detection can be viewed as a subtask of speech act or dialogue act tagging, which aims to label functions of utterances in conversations, with categories as question/statement/backchannel, or more specific categories such as request or command (e. [sent-18, score-0.842]
15 (2000) showed that prosodic features were useful for question detection in English conversational speech, but (at least in the absence of recognition errors) most of the performance was achieved with words alone. [sent-24, score-0.777]
16 There has been some previous investigation of domain adaptation for dialogue act classification, including adaptation between: different speech corpora (MRDA and Switchboard) (Guz et al. [sent-25, score-0.683]
17 , 2010), speech corpora in different languages (Margolis et al. [sent-26, score-0.105]
18 , 2010), and from a speech domain (MRDA/Switchboard) to text domains (emails and forums) (Jeong et al. [sent-27, score-0.144]
19 These works did not use prosodic features, although Venkataraman Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-29, score-0.417]
20 (2003) included prosodic features in a semisupervised learning approach for dialogue act labeling within a single spoken domain. [sent-32, score-0.763]
21 (201 1), who compared question types in different Portuguese corpora, including text and speech. [sent-34, score-0.105]
22 For question detection on speech, they compared performance of a lexical model trained with newspaper text to models trained with speech including acoustic and prosodic features, where the speech-trained model also utilized the text-based model predictions as a feature. [sent-35, score-0.777]
23 They reported that the lexical model mainly identified wh questions, while the speech data helped identify yes-no and tag questions, although results for specific categories were not included. [sent-36, score-0.184]
24 Question detection is related to the task of automatic punctuation annotation, for which the contributions of lexical and prosodic features have been explored in other works, e. [sent-37, score-0.702]
25 (2006) used auxil- iary text corpora to train lexical models for punctuation annotation or sentence segmentation, which were used along with speech-trained prosodic models; the text corpora consisted of broadcast news or telephone conversation transcripts. [sent-42, score-0.621]
26 (2009) used lexical models built from web news articles on broadcast news speech, and compared their performance on written news; Shen et al. [sent-44, score-0.108]
27 (2009) trained models on an online encyclopedia, for punctuation annotation of news podcasts. [sent-45, score-0.16]
28 Web text was also used in a domain adaptation strategy for prosodic phrase prediction in news text (Chen et al. [sent-46, score-0.641]
29 In our work, we focus on spontaneous conversational speech, and utilize a web text source that is somewhat matched in style: both domains consist of goal-directed multi-party conversations. [sent-48, score-0.101]
30 We focus specifically on question detection in pre-segmented utterances. [sent-49, score-0.185]
31 This differs from punctuation annotation or segmentation, which is usually seen as a sequence tagging or classification task at word boundaries, and uses mostly local features. [sent-50, score-0.128]
32 Our focus also allows us to clearly analyze the performance on different question types, in isolation from segmenta- tion issues. [sent-51, score-0.105]
33 We compare performance of textualand speech-trained lexical models, and examine the detection accuracy of each question type. [sent-52, score-0.229]
34 Finally, 119 we compare two domain adaptation approaches to utilize unlabeled speech data: bootstrapping, and Blitzer et al. [sent-53, score-0.374]
35 SCL is a featurelearning method that uses unlabeled data from both domains. [sent-56, score-0.077]
36 Although it has been applied to several NLP tasks, to our knowledge we are the first to apply SCL to both lexical and prosodic features in order to adapt from text to speech. [sent-57, score-0.52]
37 1 Data The Wiki talk pages consist of threaded posts by different authors about a particular Wikipedia entry. [sent-59, score-0.085]
38 While these lack certain properties of spontaneous speech (such as backchannels, disfluencies, and interruptions), they are more conversational than news articles, containing utterances such as: “Are you se- rious? [sent-60, score-0.442]
39 ” We first cleaned the posts (to remove URLs, images, signatures, Wiki markup, and duplicate posts) and then performed automatic segmentation of the posts into sentences using MXTERMINATOR (Reynar and Ratnaparkhi, 1997). [sent-62, score-0.159]
40 We labeled each sentence ending in a question mark (followed optionally by other punctuation) as a question; we also included parentheticals ending in question marks. [sent-63, score-0.253]
41 We then removed all punctuation and capitalization from the resulting sentences and performed some additional text normalization to match the MRDA transcripts, such as number and date expansion. [sent-65, score-0.127]
42 For the MRDA corpus, we use the manuallytranscribed sentences with utterance time alignments. [sent-66, score-0.096]
43 The corpus has been hand-annotated with detailed dialogue act tags, using a hierarchical labeling scheme in which each utterance receives one “general” label plus a variable number of “specific” labels (Dhillon et al. [sent-67, score-0.329]
44 In this work we are only looking at the problem of discriminating questions from non-questions; we consider as questions all complete utterances labeled with one of the gen- eral labels wh, yes-no, open-ended, or, or-after-yesno, or rhetorical question. [sent-69, score-0.419]
45 (To derive the question categories below, we also consider the specific labels tag and declarative, which are appended to one of the general labels. [sent-70, score-0.14]
46 ) All remaining utterances, including backchannels and incomplete questions, are considered as non-questions, although we removed utterances that are very short (less than 200ms), have no transcribed words, or are missing segmentation times or dialogue act label. [sent-71, score-0.548]
47 For the adaptation experiments, we used the full MRDA training set of 72k utterances as unlabeled adaptation data. [sent-77, score-0.587]
48 2 Features and Classifier Lexical features consisted of unigrams through trigrams including start- and end-utterance tags, represented as binary features (presence/absence), plus a total-number-of-words feature. [sent-81, score-0.118]
49 All ngram features were required to occur at least twice in the training set. [sent-82, score-0.104]
50 The MRDA training set contained on the order of 65k ngram features while the Wiki training set contained over 205k. [sent-83, score-0.104]
51 (2009) showed no clear benefit of these features for question detection on MRDA beyond the ngram features. [sent-85, score-0.289]
52 We extracted 16 prosody features from the speech waveforms defined by the given utterance times, using stylized F0 contours computed based on S ¨onmez et al. [sent-86, score-0.399]
53 The features are designed to be useful for detecting questions and are similar or identical to some of those in Boakye et al. [sent-88, score-0.145]
54 Prosodic and lexical features were combined by concatenation into a single feature vector; prosodic features and the number-of-words were z-normalized to place them roughly on the same scale as the binary ngram features. [sent-95, score-0.624]
55 (We substituted 0 for missing prosody features due to, e. [sent-96, score-0.198]
56 , no voiced frames detected, segmentation errors, utterance too short. [sent-98, score-0.197]
57 ) Our setup is similar to (Surendran and Levow, 2006), who combined ngram and prosodic features for dialogue act classification using a linear SVM. [sent-99, score-0.754]
58 Since ours is a detection problem, with questions much less frequent than non-questions, we present results in terms of ROC curves, which were computed from the probability scores of the classifier. [sent-100, score-0.166]
59 3 Baseline Results Figure 1 shows the ROC curves for the baseline Wiki-trained lexical system and the MRDA-trained systems with different feature sets. [sent-105, score-0.083]
60 Table 2 compares performance across different question categories at a fixed false positive rate (16. [sent-106, score-0.176]
61 For analysis purposes we defined the categories in Table 2 as follows: tag includes any yes-no question given the additional tag label; declarative includes any question category given the declarative label that is not a tag question; the remaining categories (yes-no, or, etc. [sent-108, score-0.508]
62 ) include utterances in those categories but not included in declarative or tag. [sent-109, score-0.353]
63 For the MRDA-trained system, prosody alone does best on yes-no and declarative. [sent-112, score-0.139]
64 Along with lexical features, prosody is more useful for declarative, while it appears to be somewhat redundant with lexical features for yes-no. [sent-113, score-0.286]
65 beled spoken utterances to incorporate prosodic features into the Wiki system, which may improve de- tection of some kinds of questions. [sent-116, score-0.734]
66 Figure 1: ROC curves with AUC values for question detection on MRDA; comparison between systems trained on MRDA using lexical and/or prosodic features, and Wiki talk pages using lexical features. [sent-117, score-0.763]
67 4 Adaptation Results For bootstrapping, we first train an initial baseline classifier using the Wiki training data, then use it to label MRDA data from the unlabeled adaptation set. [sent-119, score-0.256]
68 In order to use prosodic features, which are 121 each system (L=lexical Detection rates are given (starred points in Figure point for the MRDA (L) sult for each type. [sent-122, score-0.417]
69 ) Boldface indicates adaptation results better than baseline; italics indicate worse than baseline. [sent-128, score-0.153]
70 available only in the bootstrapped MRDA data, we simply add 16 zeros onto the Wiki examples in place of the missing prosodic features. [sent-129, score-0.417]
71 , 2006) uses the unlabeled target data to learn domainindependent features. [sent-132, score-0.077]
72 In particular, if x is a row vector representing the original feature vector and yi represents the label for auxiliary task i, the linear predictor wi is learned to predict yˆi = wi · x′ (where x′ is a modified version of x that excludes any features completely predictive of yi. [sent-135, score-0.138]
73 Ideally, features that behave similarly across many yi will be represented in the same singular vector; thus, the auxiliary tasks can tie together features which may never occur together in the same example. [sent-138, score-0.25]
74 As auxiliary tasks yi, we identify all initial words that begin an utterance at least 5 times in each domain’s training set, and predict the presence of each initial word (yi = 0 or 1). [sent-141, score-0.219]
75 The idea of using the initial words is that they may be related to the interrogative status of an utterance— utterances starting with “do” or “what” are more often questions, while those starting with “i” are usually not. [sent-142, score-0.23]
76 The prediction features x′ used in SCL include all ngrams occuring at least 5 times in the unlabeled Wiki or MRDA data, except those over the first word, as well as prosody features (which are zero in the Wiki data. [sent-144, score-0.334]
77 Table 3 shows results by question type at the fixed false positive point chosen for analysis. [sent-147, score-0.141]
78 At this point, both adaptation methods improved detection of declarative and yes-no questions, although they decreased detection of several other types. [sent-148, score-0.427]
79 Note that we also experimented with other adaptation approaches on the dev set: bootstrapping without the prosodic features did not lead to an improvement, nor did training on Wiki using “fake” prosody features predicted based on MRDA examples. [sent-149, score-0.915]
80 We also tried a co-training approach using separate prosodic and lexical classifiers, inspired by the work of Guz et al. [sent-150, score-0.461]
81 Since we tuned and selected adaptation methods on the MRDA dev set, we compare to training with the labeled MRDA dev (with prosodic features) and Wiki data together. [sent-152, score-0.715]
82 This gives superior results compared 122 to adaptation; but note that the adaptation process did not use labeled MRDA data to train, but merely for model selection. [sent-153, score-0.196]
83 Analysis of the adapted systems suggests prosody features are being utilized to improve performance in both methods, but clearly the effect is small, and the need to tune parameters would present a challenge if no labeled speech data were available. [sent-154, score-0.346]
84 4 Conclusion This work explored the use of conversational web text to detect questions in conversational speech. [sent-156, score-0.238]
85 We found that the web text does especially poorly on declarative questions, which can potentially be improved using prosodic features. [sent-157, score-0.531]
86 Unsupervised adaptation methods utilizing unlabeled speech and a small labeled development set are shown to improve performance slightly, although training with the small development set leads to bigger gains. [sent-158, score-0.378]
87 Our work suggests approaches for combining large amounts of “naturally” annotated web text with unannotated speech data, which could be useful in other spoken language processing tasks, e. [sent-159, score-0.159]
88 Automatic dialog act segmentation and classification in multiparty meetings. [sent-165, score-0.309]
89 Improving prosodic phrase prediction by unsupervised adaptation and syntactic features extraction. [sent-182, score-0.629]
90 Co-training using prosodic and lexical information for sentence segmentation. [sent-214, score-0.461]
91 Cascaded model adaptation for dialog act segmentation and tagging. [sent-219, score-0.462]
92 A combined punctuation generation and speech recognition system and its performance enhancement using prosody. [sent-236, score-0.247]
93 Modeling lexical tones for Mandarin large vocabulary continuous speech recognition. [sent-240, score-0.149]
94 Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. [sent-246, score-0.225]
95 Domain adaptation with unlabeled data for dialog act tagging. [sent-251, score-0.482]
96 Can prosody aid the automatic classi- fication of dialog acts in conversational speech? [sent-275, score-0.317]
97 The ICSI meet- ing recorder dialog act (MRDA) corpus. [sent-279, score-0.318]
98 Dialogue act modeling for automatic tagging and recognition of conversational speech. [sent-290, score-0.266]
99 Dialog act tagging with support vector machines and hidden Markov models. [sent-294, score-0.15]
100 Training a prosodybased dialog act tagger from unlabeled data. [sent-299, score-0.329]
wordName wordTfidf (topN-words)
[('mrda', 0.572), ('prosodic', 0.417), ('wiki', 0.219), ('utterances', 0.204), ('adaptation', 0.153), ('act', 0.15), ('prosody', 0.139), ('shriberg', 0.127), ('declarative', 0.114), ('elizabeth', 0.11), ('scl', 0.107), ('question', 0.105), ('speech', 0.105), ('dialog', 0.102), ('punctuation', 0.102), ('utterance', 0.096), ('boakye', 0.088), ('guz', 0.088), ('questions', 0.086), ('dialogue', 0.083), ('detection', 0.08), ('dhillon', 0.078), ('unlabeled', 0.077), ('conversational', 0.076), ('margolis', 0.066), ('recorder', 0.066), ('roc', 0.064), ('features', 0.059), ('segmentation', 0.057), ('spoken', 0.054), ('dev', 0.051), ('posts', 0.051), ('conversations', 0.049), ('auc', 0.048), ('auxiliary', 0.047), ('ngram', 0.045), ('hannah', 0.044), ('moniz', 0.044), ('sonali', 0.044), ('surendran', 0.044), ('umit', 0.044), ('voiced', 0.044), ('lexical', 0.044), ('labeled', 0.043), ('blitzer', 0.043), ('ang', 0.041), ('stolcke', 0.041), ('recognition', 0.04), ('curves', 0.039), ('woodland', 0.039), ('ebastien', 0.039), ('onmez', 0.039), ('christensen', 0.039), ('domain', 0.039), ('dilek', 0.038), ('bootstrapping', 0.037), ('mari', 0.036), ('acoustics', 0.036), ('false', 0.036), ('starred', 0.036), ('coccaro', 0.036), ('reynar', 0.036), ('gokhan', 0.036), ('icsi', 0.036), ('marie', 0.036), ('jeremy', 0.036), ('categories', 0.035), ('signal', 0.034), ('talk', 0.034), ('electrical', 0.034), ('bates', 0.034), ('jeong', 0.034), ('venkataraman', 0.034), ('news', 0.032), ('ries', 0.032), ('carol', 0.032), ('yi', 0.032), ('andreas', 0.031), ('meetings', 0.03), ('ostendorf', 0.03), ('backchannels', 0.029), ('singular', 0.029), ('emails', 0.028), ('bhagat', 0.028), ('liblinear', 0.027), ('anna', 0.026), ('acoustic', 0.026), ('rachel', 0.026), ('initial', 0.026), ('annotation', 0.026), ('kim', 0.025), ('gravano', 0.025), ('boldface', 0.025), ('klaus', 0.025), ('spontaneous', 0.025), ('capitalization', 0.025), ('transcribed', 0.025), ('regression', 0.024), ('martin', 0.024), ('tasks', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
Author: Anna Margolis ; Mari Ostendorf
Abstract: We investigate the use of textual Internet conversations for detecting questions in spoken conversations. We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation.
2 0.36216009 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
3 0.23491667 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond
Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.
Author: Fabrizio Morbini ; Kenji Sagae
Abstract: Individual utterances often serve multiple communicative purposes in dialogue. We present a data-driven approach for identification of multiple dialogue acts in single utterances in the context of dialogue systems with limited training data. Our approach results in significantly increased understanding of user intent, compared to two strong baselines.
5 0.13109289 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue
Author: Kristy Boyer ; Joseph Grafsgaard ; Eun Young Ha ; Robert Phillips ; James Lester
Abstract: Dialogue act classification is a central challenge for dialogue systems. Although the importance of emotion in human dialogue is widely recognized, most dialogue act classification models make limited or no use of affective channels in dialogue act classification. This paper presents a novel affect-enriched dialogue act classifier for task-oriented dialogue that models facial expressions of users, in particular, facial expressions related to confusion. The findings indicate that the affectenriched classifiers perform significantly better for distinguishing user requests for feedback and grounding dialogue acts within textual dialogue. The results point to ways in which dialogue systems can effectively leverage affective channels to improve dialogue act classification. 1
6 0.12724009 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
7 0.12565984 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
8 0.12266383 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
9 0.11482646 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus
10 0.10199284 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
11 0.097190939 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
12 0.089877129 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?
13 0.084956288 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
14 0.081470765 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
15 0.079718515 91 acl-2011-Data-oriented Monologue-to-Dialogue Generation
16 0.074389771 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
17 0.071337827 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
18 0.066280581 332 acl-2011-Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification
19 0.065547362 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
20 0.064328581 118 acl-2011-Entrainment in Speech Preceding Backchannels.
topicId topicWeight
[(0, 0.152), (1, 0.06), (2, -0.015), (3, 0.005), (4, -0.236), (5, 0.204), (6, 0.002), (7, -0.028), (8, 0.028), (9, 0.076), (10, 0.092), (11, -0.011), (12, 0.022), (13, 0.018), (14, 0.085), (15, 0.032), (16, -0.073), (17, -0.045), (18, 0.085), (19, -0.109), (20, 0.039), (21, -0.167), (22, -0.134), (23, 0.229), (24, 0.027), (25, 0.113), (26, 0.161), (27, -0.086), (28, -0.038), (29, 0.053), (30, -0.103), (31, -0.012), (32, 0.145), (33, -0.11), (34, 0.002), (35, -0.011), (36, 0.053), (37, -0.004), (38, -0.013), (39, -0.042), (40, -0.104), (41, -0.044), (42, -0.092), (43, 0.069), (44, -0.004), (45, -0.066), (46, -0.025), (47, -0.041), (48, 0.008), (49, -0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.9291262 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
Author: Anna Margolis ; Mari Ostendorf
Abstract: We investigate the use of textual Internet conversations for detecting questions in spoken conversations. We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation.
2 0.89487422 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns
Author: Je Hun Jeon ; Wen Wang ; Yang Liu
Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.
3 0.78297561 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond
Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.
4 0.72805923 312 acl-2011-Turn-Taking Cues in a Human Tutoring Corpus
Author: Heather Friedberg
Abstract: Most spoken dialogue systems are still lacking in their ability to accurately model the complex process that is human turntaking. This research analyzes a humanhuman tutoring corpus in order to identify prosodic turn-taking cues, with the hopes that they can be used by intelligent tutoring systems to predict student turn boundaries. Results show that while there was variation between subjects, three features were significant turn-yielding cues overall. In addition, a positive relationship between the number of cues present and the probability of a turn yield was demonstrated. 1
5 0.71087354 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport
Author: Siwei Wang ; Gina-Anne Levow
Abstract: Verbal feedback is an important information source in establishing interactional rapport. However, predicting verbal feedback across languages is challenging due to languagespecific differences, inter-speaker variation, and the relative sparseness and optionality of verbal feedback. In this paper, we employ an approach combining classifier weighting and SMOTE algorithm oversampling to improve verbal feedback prediction in Arabic, English, and Spanish dyadic conversations. This approach improves the prediction of verbal feedback, up to 6-fold, while maintaining a high overall accuracy. Analyzing highly weighted features highlights widespread use of pitch, with more varied use of intensity and duration.
6 0.70223063 118 acl-2011-Entrainment in Speech Preceding Backchannels.
7 0.63094968 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
8 0.46023065 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
9 0.45510629 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
10 0.43211398 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
11 0.43157518 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
12 0.39118171 33 acl-2011-An Affect-Enriched Dialogue Act Classification Model for Task-Oriented Dialogue
13 0.37783074 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
14 0.36583525 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems
15 0.36366493 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition
16 0.3388541 179 acl-2011-Is Machine Translation Ripe for Cross-Lingual Sentiment Classification?
17 0.32871205 238 acl-2011-P11-2093 k2opt.pdf
18 0.3232525 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
19 0.32266632 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
20 0.31366196 278 acl-2011-Semi-supervised condensed nearest neighbor for part-of-speech tagging
topicId topicWeight
[(5, 0.041), (9, 0.196), (17, 0.053), (26, 0.015), (31, 0.012), (37, 0.097), (39, 0.057), (41, 0.133), (44, 0.016), (55, 0.023), (59, 0.034), (61, 0.012), (72, 0.036), (88, 0.019), (91, 0.037), (96, 0.142), (97, 0.011)]
simIndex simValue paperId paperTitle
1 0.8660866 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories
Author: Truc Vien T. Nguyen ; Alessandro Moschitti
Abstract: In this paper, we extend distant supervision (DS) based on Wikipedia for Relation Extraction (RE) by considering (i) relations defined in external repositories, e.g. YAGO, and (ii) any subset of Wikipedia documents. We show that training data constituted by sentences containing pairs of named entities in target relations is enough to produce reliable supervision. Our experiments with state-of-the-art relation extraction models, trained on the above data, show a meaningful F1 of 74.29% on a manually annotated test set: this highly improves the state-of-art in RE using DS. Additionally, our end-to-end experiments demonstrated that our extractors can be applied to any general text document.
2 0.81937456 231 acl-2011-Nonlinear Evidence Fusion and Propagation for Hyponymy Relation Mining
Author: Fan Zhang ; Shuming Shi ; Jing Liu ; Shuqi Sun ; Chin-Yew Lin
Abstract: This paper focuses on mining the hyponymy (or is-a) relation from large-scale, open-domain web documents. A nonlinear probabilistic model is exploited to model the correlation between sentences in the aggregation of pattern matching results. Based on the model, we design a set of evidence combination and propagation algorithms. These significantly improve the result quality of existing approaches. Experimental results conducted on 500 million web pages and hypernym labels for 300 terms show over 20% performance improvement in terms of P@5, MAP and R-Precision. 1 Introduction1 An important task in text mining is the automatic extraction of entities and their lexical relations; this has wide applications in natural language processing and web search. This paper focuses on mining the hyponymy (or is-a) relation from largescale, open-domain web documents. From the viewpoint of entity classification, the problem is to automatically assign fine-grained class labels to terms. There have been a number of approaches (Hearst 1992; Pantel & Ravichandran 2004; Snow et al., 2005; Durme & Pasca, 2008; Talukdar et al., 2008) to address the problem. These methods typically exploited manually-designed or automatical* This work was performed when Fan Zhang and Shuqi Sun were interns at Microsoft Research Asia 1159 ly-learned patterns (e.g., “NP such as NP”, “NP like NP”, “NP is a NP”). Although some degree of success has been achieved with these efforts, the results are still far from perfect, in terms of both recall and precision. As will be demonstrated in this paper, even by processing a large corpus of 500 million web pages with the most popular patterns, we are not able to extract correct labels for many (especially rare) entities. Even for popular terms, incorrect results often appear in their label lists. The basic philosophy in existing hyponymy extraction approaches (and also many other textmining methods) is counting: count the number of supporting sentences. Here a supporting sentence of a term-label pair is a sentence from which the pair can be extracted via an extraction pattern. We demonstrate that the specific way of counting has a great impact on result quality, and that the state-ofthe-art counting methods are not optimal. Specifically, we examine the problem from the viewpoint of probabilistic evidence combination and find that the probabilistic assumption behind simple counting is the statistical independence between the observations of supporting sentences. By assuming a positive correlation between supporting sentence observations and adopting properly designed nonlinear combination functions, the results precision can be improved. It is hard to extract correct labels for rare terms from a web corpus due to the data sparseness problem. To address this issue, we propose an evidence propagation algorithm motivated by the observation that similar terms tend to share common hypernyms. For example, if we already know that 1) Helsinki and Tampere are cities, and 2) Porvoo is similar to Helsinki and Tampere, then Porvoo is ProceedingPso orftla thned 4,9 Otrhe Agonnn,u Jauln Mee 1e9t-i2ng4, o 2f0 t1h1e. A ?c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s159–1168, very likely also a city. This intuition, however, does not mean that the labels of a term can always be transferred to its similar terms. For example, Mount Vesuvius and Kilimanjaro are volcanoes and Lhotse is similar to them, but Lhotse is not a volcano. Therefore we should be very conservative and careful in hypernym propagation. In our propagation algorithm, we first construct some pseudo supporting sentences for a term from the supporting sentences of its similar terms. Then we calculate label scores for terms by performing nonlinear evidence combination based on the (pseudo and real) supporting sentences. Such a nonlinear propagation algorithm is demonstrated to perform better than linear propagation. Experimental results on a publicly available collection of 500 million web pages with hypernym labels annotated for 300 terms show that our nonlinear evidence fusion and propagation significantly improve the precision and coverage of the extracted hyponymy data. This is one of the technologies adopted in our semantic search and min- ing system NeedleSeek2. In the next section, we discuss major related efforts and how they differ from our work. Section 3 is a brief description of the baseline approach. The probabilistic evidence combination model that we exploited is introduced in Section 4. Our main approach is illustrated in Section 5. Section 6 shows our experimental settings and results. Finally, Section 7 concludes this paper. 2 Related Work Existing efforts for hyponymy relation extraction have been conducted upon various types of data sources, including plain-text corpora (Hearst 1992; Pantel & Ravichandran, 2004; Snow et al., 2005; Snow et al., 2006; Banko, et al., 2007; Durme & Pasca, 2008; Talukdar et al., 2008), semistructured web pages (Cafarella et al., 2008; Shinzato & Torisawa, 2004), web search results (Geraci et al., 2006; Kozareva et al., 2008; Wang & Cohen, 2009), and query logs (Pasca 2010). Our target for optimization in this paper is the approaches that use lexico-syntactic patterns to extract hyponymy relations from plain-text corpora. Our future work will study the application of the proposed algorithms on other types of approaches. 2 http://research.microsoft.com/en-us/projects/needleseek/ or http://needleseek.msra.cn/ 1160 The probabilistic evidence combination model that we exploit here was first proposed in (Shi et al., 2009), for combining the page in-link evidence in building a nonlinear static-rank computation algorithm. We applied it to the hyponymy extraction problem because the model takes the dependency between supporting sentences into consideration and the resultant evidence fusion formulas are quite simple. In (Snow et al., 2006), a probabilistic model was adopted to combine evidence from heterogeneous relationships to jointly optimize the relationships. The independence of evidence was assumed in their model. In comparison, we show that better results will be obtained if the evidence correlation is modeled appropriately. Our evidence propagation is basically about using term similarity information to help instance labeling. There have been several approaches which improve hyponymy extraction with instance clusters built by distributional similarity. In (Pantel & Ravichandran, 2004), labels were assigned to the committee (i.e., representative members) of a semantic class and used as the hypernyms of the whole class. Labels generated by their approach tend to be rather coarse-grained, excluding the possibility of a term having its private labels (considering the case that one meaning of a term is not covered by the input semantic classes). In contrast to their method, our label scoring and ranking approach is applied to every single term rather than a semantic class. In addition, we also compute label scores in a nonlinear way, which improves results quality. In Snow et al. (2005), a supervised approach was proposed to improve hypernym classification using coordinate terms. In comparison, our approach is unsupervised. Durme & Pasca (2008) cleaned the set of instance-label pairs with a TF*IDF like method, by exploiting clusters of semantically related phrases. The core idea is to keep a term-label pair (T, L) only if the number of terms having the label L in the term T’s cluster is above a threshold and if L is not the label of too many clusters (otherwise the pair will be discarded). In contrast, we are able to add new (high-quality) labels for a term with our evidence propagation method. On the other hand, low quality labels get smaller score gains via propagation and are ranked lower. Label propagation is performed in (Talukdar et al., 2008; Talukdar & Pereira, 2010) based on multiple instance-label graphs. Term similarity information was not used in their approach. Most existing work tends to utilize small-scale or private corpora, whereas the corpus that we used is publicly available and much larger than most of the existing work. We published our term sets (refer to Section 6. 1) and their corresponding user judgments so researchers working on similar topics can reproduce our results. H eTIsaryApst-eI {N aP nLd(|{iso,}ra (eNisn|.wuPcgalhsu|edwa.gs(e)r{PN|baiPent,c}lg*ur){nd(ai n|dg)o {r}N P,L}* IsA-II NP (is|are|was|were|being) {the, those} NPL IsA-III NP (is|are|was|were|being) {another, any} NPL Table 1. Patterns adopted in this paper (NP: named phrase representing an entity; NPL: label) 3 Preliminaries The problem addressed in this paper is corpusbased is-a relation mining: extracting hypernyms (as labels) for entities from a large-scale, open- domain document corpus. The desired output is a mapping from terms to their corresponding hypernyms, which can naturally be represented as a weighted bipartite graph (term-label graph). Typically we are only interested in top labels of a term in the graph. Following existing efforts, we adopt patternmatching as a basic way of extracting hypernymy/hyponymy relations. Two types of patterns (refer to Table 1) are employed, including the popular “Hearst patterns” (Hearst, 1992) and the IsA patterns which are exploited less frequently in existing hyponym mining efforts. One or more termlabel pairs can be extracted if a pattern matches a sentence. In the baseline approach, the weight of an edge TL (from term T to hypernym label L) in the term-label graph is computed as, ( ) w(TL) ( ) (3.1) where m is the number of times the pair (T, L) is extracted from the corpus, DF(L) is the number of in-links of L in the graph, N is total number of terms in the graph, and IDF means the “inverse document frequency”. A term can only keep its top-k neighbors (according to the edge weight) in the graph as its final labels. 1161 Our pattern matching algorithm implemented in this paper uses part-of-speech (POS) tagging information, without adopting a parser or a chunker. The noun phrase boundaries (for terms and labels) are determined by a manually designed POS tag list. 4 Probabilistic Label-Scoring Model Here we model the hyponymy extraction problem from the probability theory point of view, aiming at estimating the score of a term-label pair (i.e., the score of a label w.r.t. a term) with probabilistic evidence combination. The model was studied in (Shi et al., 2009) to combine the page in-link evidence in building a nonlinear static-rank computation algorithm. We represent the score of a term-label pair by the probability of the label being a correct hypernym of the term, and define the following events, AT,L: Label L is a hypernym of term T (the abbreviated form A is used in this paper unless it is ambiguous). Ei: The observation that (T, L) is extracted from a sentence Si via pattern matching (i.e., Si is a sup- porting sentence of the pair). Assuming that we already know m supporting sentences (S1~Sm), our problem is to compute P(A|E1,E2,..,Em), the posterior probability that L is a hypernym of term T, given evidence E1~Em. Formally, we need to find a function f to satisfy, P(A|E1,… ,Em) = f(P(A), P(A|E1)… P(A|Em) ) (4.1) … … …, For simplicity, we first consider the case of m=2. The case of m>2 is quite similar. We start from the simple case of independent supporting sentences. That is, ( ) ( ) ( ) ( ) ( )( ) By applying Bayes rule, we get, ( (4.2) (4.3) ) ( ( ) )() ( ( ) ) ( ) ( () ) ( ) ( ) (4.4) ( ) ( ) ( ) Then define ( ) ( ( )) ( ( )) ( ( )) Here G(A|E) represents the log-probability-gain of A given E, with the meaning of the gain in the log-probability value of A after the evidence E is observed (or known). It is a measure of the impact of evidence E to the probability of event A. With the definition of G(A|E), Formula 4.4 can be transformed to, ( ) ( ) ( ) (4.5) Therefore, if E1 and E2 are independent, the logprobability-gain of A given both pieces of evidence will exactly be the sum of the gains of A given every single piece of evidence respectively. It is easy to prove (by following a similar procedure) that the above Formula holds for the case of m>2, as long as the pieces of evidence are mutually independent. Therefore for a term-label pair with m mutually independent supporting sentences, if we set every gain G(A|Ei) to be a constant value g, the posterior gain score of the pair will be ∑ If the value g is the IDF of label L, the posterior gain will be, . G(AT,L|E1… ,Em) ∑ ( ) ( ) (4.6) This is exactly the Formula 3. 1. By this way, we provide a probabilistic explanation of scoring the candidate labels for a term via simple counting. … TRaAb:le(2.A/E)Rv(ide)ncHd065epa.9r81sn7t-dIec10ys.7Ae3-1I0timEa2o:8nH0Is.4feA2oa3-r78I0sitna- pattern and inter-pattern supporting sentences In the above analysis, we assume the statistical independence of the supporting sentence observations, which may not hold in reality. Intuitively, if we already know one supporting sentence S1 for a term-label pair (T, L), then we have more chance to find another supporting sentence than if we do not know S1. The reason is that, before we find S1, we have to estimate the probability with the chance of discovering a supporting sentence for a random term-label pair. The probability is quite low because most term-label pairs do not have hyponymy relations. Once we have observed S1, however, the chance of (T, L) having a hyponymy relation in1162 creases. Therefore the chance of observing another supporting sentence becomes larger than before. Table 2 shows the rough estimation of ( ( ) ( ) ) (denoted as RA), ( ( ) ( ) ) (denoted as R), and their ratios. The statistics are obtained by performing maximal likelihood estimation (MLE) upon our corpus and a random selection of term-label pairs from our term sets (see Section 6. 1) together with their top labels3. The data verifies our analysis about the correlation between E1 and E2 (note that R=1 means independent). In addition, it can be seen that the conditional independence assumption of Formula 4.3 does not hold (because RA>1). It is hence necessary to consider the correlation between supporting sentences in the model. The estimation of Table 2 also indicates that, ( ( ) ( )) ( ( ) ( ) ) (4.7) By following a similar procedure as above, with Formulas 4.2 and 4.3 replaced by 4.7, we have, ( ) ( ) ( ) (4.8) This formula indicates that when the supporting sentences are positively correlated, the posterior score of label L w.r.t. term T (given both the sen- tences) is smaller than the sum of the gains caused by one sentence only. In the extreme case that sentence S2 fully depends on E1 (i.e. P(E2|E1)=1), it is easy to prove that ( ) ( ) It is reasonable, since event E2 does not bring in more information than E1. Formula 4.8 cannot be used directly for computing the posterior gain. What we really need is a function h satisfying () ( ( ) ( )) (4.9) and ( )∑ (4.10) Shi et al. (2009) discussed other constraints to h and suggested the following nonlinear functions, ( ) ( ∑ ( )) (4. 11) 3 RA is estimated from the labels judged as “Good”; whereas the estimation of R is from all judged labels. ( ) √ ∑ (p>1) (4.12) In the next section, we use the above two h func- tions as basic building blocks to compute label scores for terms. 5 Our Approach Multiple types of patterns (Table 1) can be adopted to extract term-label pairs. For two supporting sentences the correlation between them may depend on whether they correspond to the same pattern. In Section 5. 1, our nonlinear evidence fusion formulas are constructed by making specific assumptions about the correlation between intra-pattern supporting sentences and inter-pattern ones. Then in Section 5.2, we introduce our evidence propagation technique in which the evidence of a (T, L) pair is propagated to the terms similar to T. 5.1 Nonlinear evidence fusion For a term-label pair (T, L), assuming K patterns are used for hyponymy extraction and the supporting sentences discovered with pattern iare, (5.1) where mi is the number of supporting sentences corresponding to pattern i. Also assume the gain score of Si,j is xi,j, i.e., xi,j=G(A|Si,j). Generally speaking, supporting sentences corre- sponding to the same pattern typically have a higher correlation than the sentences corresponding to different patterns. This can be verified by the data in Table-2. By ignoring the inter-pattern correlations, we make the following simplified assumption: Assumption: Supporting sentences corresponding to the same pattern are correlated, while those of different patterns are independent. According to this assumption, our label-scoring function is, ( ) ∑ ( ) (5.2) In the simple case that ( ) , if the h function of Formula 4. 12 is adopted, then, ( ) (∑ √ ) ( ) (5.3) 1163 We use an example to illustrate the above formula. Example: For term T and label L1, assume the numbers of the supporting sentences corresponding to the six pattern types in Table 1 are (4, 4, 4, 4, 4, 4), which means the number of supporting sentences discovered by each pattern type is 4. Also assume the supporting-sentence-count vector of label L2 is (25, 0, 0, 0, 0, 0). If we use Formula 5.3 to compute the scores of L1 and L2, we can have the following (ignoring IDF for simplicity), Score(L1) Score(L2) One the other hand, if we simply count the total number of supporting sentences, the score of L2 will be larger. The rationale implied in the formula is: For a given term T, the labels supported by multiple types of patterns tend to be more reliable than those supported by a single pattern type, if they have the same number of supporting sentences. √ ; √ 5.2 Evidence propagation According to the evidence fusion algorithm described above, in order to extract term labels reliably, it is desirable to have many supporting sentences of different types. This is a big challenge for rare terms, due to their low frequency in sentences (and even lower frequency in supporting sentences because not all occurrences can be covered by patterns). With evidence propagation, we aim at discovering more supporting sentences for terms (especially rare terms). Evidence propagation is motivated by the following two observations: (I) Similar entities or coordinate terms tend to share some common hypernyms. (II) Large term similarity graphs are able to be built efficiently with state-of-the-art techniques (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). With the graphs, we can obtain the similarity between two terms without their hypernyms being available. The first observation motivates us to “borrow” the supporting sentences from other terms as auxiliary evidence of the term. The second observation means that new information is brought with the state-of-the-art term similarity graphs (in addition to the term-label information discovered with the patterns of Table 1). Our evidence propagation algorithm contains two phases. In phase I, some pseudo supporting sentences are constructed for a term from the supporting sentences of its neighbors in the similarity graph. Then we calculate the label scores for terms based on their (pseudo and real) supporting sentences. Phase I: For every supporting sentence S and every similar term T1 of the term T, add a pseudo supporting sentence S1 for T1, with the gain score, ( ) ( ( ) ) (5.5) where is the propagation factor, and ( ) is the term similarity function taking values in [0, 1]. The formula reasonably assumes that the gain score of the pseudo supporting sentence depends on the gain score of the original real supporting sentence, the similarity between the two terms, and the propagation factor. Phase II: The nonlinear evidence combination formulas in the previous subsection are adopted to combine the evidence of pseudo supporting sentences. Term similarity graphs can be obtained by distributional similarity or patterns (Agirre et al., 2009; Pantel et al., 2009; Shi et al., 2010). We call the first type of graph DS and the second type PB. DS approaches are based on the distributional hypothesis (Harris, 1985), which says that terms appearing in analogous contexts tend to be similar. In a DS approach, a term is represented by a feature vector, with each feature corresponding to a context in which the term appears. The similarity between two terms is computed as the similarity between their corresponding feature vectors. In PB approaches, a list of carefully-designed (or automatically learned) patterns is exploited and applied to a text collection, with the hypothesis that the terms extracted by applying each of the patterns to a specific piece of text tend to be similar. Two categories of patterns have been studied in the literature (Heast 1992; Pasca 2004; Kozareva et al., 2008; Zhang et al., 2009): sentence lexical patterns, and HTML tag patterns. An example of sentence lexical patterns is “T {, T} *{,} (and|or) T”. HTML tag patterns include HTML tables, drop-down lists, and other tag repeat patterns. In this paper, we generate the DS and PB graphs by adopting the best-performed methods studied in (Shi et al., 2010). We will compare, by experiments, the propagation performance of utilizing the two categories 1164 of graphs, and also investigate the performance of utilizing both graphs for evidence propagation. 6 Experiments 6.1 Experimental setup Corpus We adopt a publicly available dataset in our experiments: ClueWeb094. This is a very large dataset collected by Carnegie Mellon University in early 2009 and has been used by several tracks of the Text Retrieval Conference (TREC)5. The whole dataset consists of 1.04 billion web pages in ten languages while only those in English, about 500 million pages, are used in our experiments. The reason for selecting such a dataset is twofold: First, it is a corpus large enough for conducting webscale experiments and getting meaningful results. Second, since it is publicly available, it is possible for other researchers to reproduce the experiments in this paper. Term sets Approaches are evaluated by using two sets of selected terms: Wiki200, and Ext100. For every term in the term sets, each approach generates a list of hypernym labels, which are manually judged by human annotators. Wiki200 is constructed by first randomly selecting 400 Wikipedia6 titles as our candidate terms, with the probability of a title T being selected to be ( ( )), where F(T) is the frequency of T in our data corpus. The reason of adopting such a probability formula is to balance popular terms and rare ones in our term set. Then 200 terms are manually selected from the 400 candidate terms, with the principle of maximizing the diversity of terms in terms of length (i.e., number of words) and type (person, location, organization, software, movie, song, animal, plant, etc.). Wiki200 is further divided into two subsets: Wiki100H and Wiki100L, containing respectively the 100 high-frequency and lowfrequency terms. Ext100 is built by first selecting 200 non-Wikipedia-title terms at random from the term-label graph generated by the baseline approach (Formula 3. 1), then manually selecting 100 terms. Some sample terms in the term sets are listed in Table 3. 4 http://boston.lti.cs.cmu.edu/Data/clueweb09/ 5 http://trec.nist.gov/ 6 http://www.wikipedia.org/ Annotation For each term in the term set, the top-5 results (i.e., hypernym labels) of various methods are mixed and judged by human annotators. Each annotator assigns each result item a judgment of “Good”, “Fair” or “Bad”. The annotators do not know the method by which a result item is generated. Six annotators participated in the labeling with a rough speed of 15 minutes per term. We also encourage the annotators to add new good results which are not discovered by any method. The term sets and their corresponding user anno- tations are available for download at the following links (dataset ID=data.queryset.semcat01): http://research.microsoft.com/en-us/projects/needleseek/ http://needleseek.msra.cn/datasets/ Evaluation We adopt the following metrics to evaluate the hypernym list of a term generated by each method. The evaluation score on a term set is the average over all the terms. Precision@k: The percentage of relevant (good or fair) labels in the top-k results (labels judged as “Fair” are counted as 0.5) Recall@k: The ratio of relevant labels in the topk results to the total number of relevant labels R-Precision: Precision@R where R is the total number of labels judged as “Good” Mean average precision (MAP): The average of precision values at the positions of all good or fair results Before annotation and evaluation, the hypernym list generated by each method for each term is preprocessed to remove duplicate items. Two hypernyms are called duplicate items if they share the same head word (e.g., “military conflict” and “conflict”). For duplicate hypernyms, only the first (i.e., the highest ranked one) in the list is kept. The goal with such a preprocessing step is to partially con- sider results diversity in evaluation and to make a more meaningful comparison among different methods. Consider two hypernym lists for “subway”: List-1 : restaurant; chain restaurant; worldwide chain restaurant; franchise; restaurant franchise… List-2: restaurant; franchise; transportation; company; fast food… There are more detailed hypernyms in the first list about “subway” as a restaurant or a franchise; while the second list covers a broader range of meanings for the term. It is hard to say which is better (without considering the upper-layer applications). With this preprocessing step, we keep our focus on short hypernyms rather than detailed ones. … … … … evidence fusion methods (Term sets: Wiki200 and Wiki100H; p=2 for PNorm) 6.2 Experimental results We first compare the evaluation results of different evidence fusion methods mentioned in Section 4.1. In Table 4, Linear means that Formula 3. 1 is used to calculate label scores, whereas Log and PNorm represent our nonlinear approach with Formulas 4. 11 and 4. 12 being utilized. The performance improvement numbers shown in the table are based on the linear version; and the upward pointing arrows indicate relative percentage improvement over the baseline. From the table, we can see that the nonlinear methods outperform the linear ones on the Wiki200 term set. It is interesting to note that the performance improvement is more significant on Wiki100H, the set of high frequency terms. By examining the labels and supporting sentences for the terms in each term set, we find that for many low-frequency terms (in Wiki100L), there are only a few supporting sentences (corresponding 1165 to one or two patterns). So the scores computed by various fusion algorithms tend to be similar. In contrast, more supporting sentences can be discov- ered for high-frequency terms. Much information is contained in the sentences about the hypernyms of the high-frequency terms, but the linear function of Formula 3.1 fails to make effective use of it. The two nonlinear methods achieve better performance by appropriately modeling the dependency between supporting sentences and computing the log-probability gain in a better way. The comparison of the linear and nonlinear methods on the Ext100 term set is shown in Table 5. Please note that the terms in Ext100 do not appear in Wikipedia titles. Thanks to the scale of the data corpus we are using, even the baseline approach achieves reasonably good performance. Please note that the terms (refer to Table 3) we are using are “harder” than those adopted for evaluation in many existing papers. Again, the results quality is improved with the nonlinear methods, although the performance improvement is not big due to the reason that most terms in Ext100 are rare. Please note that the recall (R@1, R@5) in this paper is pseudo-recall, i.e., we treat the number of known relevant (Good or Fair) results as the total number of relevant ones. MTPLeNainotbhgrlmed5.M012P35A8e96r%5P4f0orRm-.aP40%2rn9ce 50P7o.@625m10p%5 ari0Ps.4o@07%n25 amR307o.@41n526g%0 vaRr0i.3o@8% u5s evidence fusion methods (Term set: Ext100; p=2 for PNorm) The parameter p in the PNorm method is related to the degree of correlations among supporting sentences. The linear method of Formula 3. 1 corresponds to the special case of p=1 ; while p= represents the case that other supporting sentences are fully correlated to the supporting sentence with the maximal log-probability gain. Figure 1 shows that, for most of the term sets, the best performance is obtained for [2.0, 4.0]. The reason may be that the sentence correlations are better estimated with p values in this range. Figure 1. Performance curves of PNorm with different parameter values (Measure: MAP) The experimental results of evidence propagation are shown in Table 6. The methods for comparison are, Base: The linear function without propagation. NL: Nonlinear evidence fusion (PNorm with p=2) without propagation. LP: Linear propagation, i.e., the linear function is used to combine the evidence of pseudo supporting sentences. NLP: Nonlinear propagation where PNorm (p=2) is used to combine the pseudo supporting sentences. NL+NLP: The nonlinear method is used to combine both supporting sentences and pseudo supporting sentences. Wiki200; Similarity graph: PB; Nonlinear formula: PNorm) In this paper, we generate the DS (distributional similarity) and PB (pattern-based) graphs by adopting the best-performed methods studied in (Shi et al., 2010). The performance improvement numbers (indicated by the upward pointing arrows) shown in tables 6~9 are relative percentage improvement 1166 over the base approach (i.e., linear function without propagation). The values of parameter are set to maximize the MAP values. Several observations can be made from Table 6. First, no performance improvement can be obtained with the linear propagation method (LP), while the nonlinear propagation algorithm (NLP) works quite well in improving both precision and recall. The results demonstrate the high correlation between pseudo supporting sentences and the great potential of using term similarity to improve hypernymy extraction. The second observation is that the NL+NLP approach achieves a much larger performance improvement than NL and NLP. Similar results (omitted due to space limitation) can be observed on the Ext100 term set. evidence propagation (Term set: Wiki200; Nonlinear formula: Log) evidence propagation (Term set: Wiki100L) Now let us study whether it is possible to combine the PB and DS graphs to obtain better results. As shown in Tables 7, 8, and 9 (for term sets Wiki200, Wiki100L, and Ext100 respectively, using the Log formula for fusion and propagation), utilizing both graphs really yields additional performance gains. We explain this by the fact that the information in the two term similarity graphs tends 1167 to be complimentary. The performance improvement over Wiki100L is especially remarkable. This is reasonable because rare terms do not have adequate information in their supporting sentences due to data sparseness. As a result, they benefit the most from the pseudo supporting sentences propagated with the similarity graphs. evidence propagation (Term set: Ext100) 7 Conclusion We demonstrated that the way of aggregating supporting sentences has considerable impact on results quality of the hyponym extraction task using lexico-syntactic patterns, and the widely-used counting method is not optimal. We applied a series of nonlinear evidence fusion formulas to the problem and saw noticeable performance improvement. The data quality is improved further with the combination of nonlinear evidence fusion and evidence propagation. We also introduced a new evaluation corpus with annotated hypernym labels for 300 terms, which were shared with the research community. Acknowledgments We would like to thank Matt Callcut for reading through the paper. Thanks to the annotators for their efforts in judging the hypernym labels. Thanks to Yueguo Chen, Siyu Lei, and the anonymous reviewers for their helpful comments and suggestions. The first author is partially supported by the NSF of China (60903028,61070014), and Key Projects in the Tianjin Science and Technology Pillar Program. References E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, and A. Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proc. of NAACL-HLT’2009. M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open Information Extraction from the Web. In Proc. of IJCAI’2007. M. Cafarella, A. Halevy, D. Wang, E. Wu, and Y. Zhang. 2008. WebTables: Exploring the Power of Tables on the Web. In Proceedings of the 34th Conference on Very Large Data Bases (VLDB’2008), pages 538–549, Auckland, New Zealand. B. Van Durme and M. Pasca. 2008. Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled instances for open-domain information extraction. Twenty-Third AAAI Conference on Artificial Intelligence. F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. 2006. Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution. In Proceedings of the 13th Conference on String Processing and Information Retrieval (SPIRE’2006), pages 25–36, Glasgow, Scotland. Z. S. Harris. 1985. Distributional Structure. The Philosophy of Linguistics. New York: Oxford University Press. M. Hearst. 1992. Automatic Acquisition of Hyponyms from Large Text Corpora. In Fourteenth International Conference on Computational Linguistics, Nantes, France. Z. Kozareva, E. Riloff, E.H. Hovy. 2008. Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs. In Proc. of ACL'2008. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu and V. Vyas. 2009. Web-Scale Distributional Similarity and Entity Set Expansion. EMNLP’2009. Singapore. P. Pantel and D. Ravichandran. 2004. Automatically Labeling Semantic Classes. In Proc. of the 2004 Human Language Technology Conference (HLTNAACL’2004), 321–328. M. Pasca. 2004. Acquisition of Categorized Named Entities for Web Search. In Proc. of CIKM’2004. M. Pasca. 2010. The Role of Queries in Ranking Labeled Instances Extracted from Text. In Proc. of COLING’2010, Beijing, China. S. Shi, B. Lu, Y. Ma, and J.-R. Wen. 2009. Nonlinear Static-Rank Computation. In Proc. of CIKM’2009, Kong Kong. 1168 S. Shi, H. Zhang, X. Yuan, J.-R. Wen. 2010. Corpusbased Semantic Class Mining: Distributional vs. Pattern-Based Approaches. In Proc. of COLING’2010, Beijing, China. K. Shinzato and K. Torisawa. 2004. Acquiring Hyponymy Relations from Web Documents. In Proc. of the 2004 Human Language (HLT-NAACL’2004). Technology Conference R. Snow, D. Jurafsky, and A. Y. Ng. 2005. Learning Syntactic Patterns for Automatic Hypernym Discovery. In Proceedings of the 19th Conference on Neural Information Processing Systems. R. Snow, D. Jurafsky, and A. Y. Ng. 2006. Semantic Taxonomy Induction from Heterogenous Evidence. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL-06), 801–808. P. P. Talukdar and F. Pereira. 2010. Experiments in Graph-based Semi-Supervised Learning Methods for Class-Instance Acquisition. In 48th Annual Meeting of the Association for Computational Linguistics (ACL’2010). P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, and F. Pereira. 2008. Weakly-Supervised Acquisition of Labeled Class Instances using Graph Random Walks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP’2008), pages 581–589. R.C. Wang. W.W. Cohen. Automatic Set Instance Extraction using the Web. In Proc. of the 47th Annual Meeting of the Association for Computational Lin- guistics (ACL-IJCNLP’2009), gapore. pages 441–449, Sin- H. Zhang, M. Zhu, S. Shi, and J.-R. Wen. 2009. Employing Topic Models for Pattern-based Semantic Class Discovery. In Proc. of the 47th Annual Meeting of the Association for Computational Linguistics (ACL-IJCNLP’2009), pages 441–449, Singapore.
3 0.80689824 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering
Author: Ang Sun ; Ralph Grishman ; Satoshi Sekine
Abstract: We present a simple semi-supervised relation extraction system with large-scale word clustering. We focus on systematically exploring the effectiveness of different cluster-based features. We also propose several statistical methods for selecting clusters at an appropriate level of granularity. When training on different sizes of data, our semi-supervised approach consistently outperformed a state-of-the-art supervised baseline system. 1
same-paper 4 0.80333573 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
Author: Anna Margolis ; Mari Ostendorf
Abstract: We investigate the use of textual Internet conversations for detecting questions in spoken conversations. We compare the text-trained model with models trained on manuallylabeled, domain-matched spoken utterances with and without prosodic features. Overall, the text-trained model achieves over 90% of the performance (measured in Area Under the Curve) of the domain-matched model including prosodic features, but does especially poorly on declarative questions. We describe efforts to utilize unlabeled spoken utterances and prosodic features via domain adaptation.
5 0.74923253 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
Author: Shasha Liao ; Ralph Grishman
Abstract: Annotating training data for event extraction is tedious and labor-intensive. Most current event extraction tasks rely on hundreds of annotated documents, but this is often not enough. In this paper, we present a novel self-training strategy, which uses Information Retrieval (IR) to collect a cluster of related documents as the resource for bootstrapping. Also, based on the particular characteristics of this corpus, global inference is applied to provide more confident and informative data selection. We compare this approach to self-training on a normal newswire corpus and show that IR can provide a better corpus for bootstrapping and that global inference can further improve instance selection. We obtain gains of 1.7% in trigger labeling and 2.3% in role labeling through IR and an additional 1.1% in trigger labeling and 1.3% in role labeling by applying global inference. 1
7 0.74262846 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
8 0.73886228 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
9 0.73762798 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
10 0.73399663 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
11 0.72802925 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
12 0.72787368 135 acl-2011-Faster and Smaller N-Gram Language Models
13 0.72722185 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
14 0.72654521 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base
15 0.72500253 40 acl-2011-An Error Analysis of Relation Extraction in Social Media Documents
16 0.72432166 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing
17 0.72377789 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts
18 0.72296095 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
19 0.7226305 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing
20 0.72160476 94 acl-2011-Deciphering Foreign Language