acl acl2011 acl2011-228 acl2011-228-reference knowledge-graph by maker-knowledge-mining

228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns

Source: pdf

Author: Je Hun Jeon ; Wen Wang ; Yang Liu

Abstract: In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount ofdata and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.

reference text

Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. 2007. Improved speech recognition using acoustic and lexical correlated of pitch accent in a n-best rescoring framework. Proc. of ICASSP, pages 65–68. Sankaranarayanan Ananthakrishnan and Shrikanth Narayanan. 2008. Automatic prosodic event detection using acoustic, lexical and syntactic evidence. IEEE Transactions on Audio, Speech, and Language Processing, 16(1):216–228. Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. 2009. On the syllabification of phonemes. Proc. of NAACL-HLT, pages 308–316. Stefan Benus, Agust ´ın Gravano, and Julia Hirschberg. 2007. Prosody, emotions, and whatever. Proc. of Interspeech, pages 2629–2632. Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. Proc. of the Workshop on Computational Learning Theory, pages 92–100. Ken Chen and Mark Hasegawa-Johnson. 2006. Prosody dependent speech recognition on radio news corpus 740 of American English. IEEE Transactions on Audio, Speech, and Language Processing, 14(1):232– 245. Najim Dehak, Pierre Dumouchel, and Patrick Kenny. 2007. Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2095–2103. Esther Grabe, Greg Kochanski, and John Coleman. 2003. Quantitative modelling of intonational variation. Proc. of SASRTLM, pages 45–57. Je Hun Jeon and Yang Liu. 2009. Automatic prosodic events detection suing syllable-based acoustic and syntactic features. Proc. of ICASSP, pages 4565–4568. Je Hun Jeon and Yang Liu. 2010. Syllable-level prominence detection with acoustic evidence. Proc. of Interspeech, pages 1772–1775. Ozlem Kalinli and Shrikanth Narayanan. 2009. Continuous speech recognition using attention shift decoding with soft decision. Proc. of Interspeech, pages 1927– 1930. Diane J. Litman, Julia B. Hirschberg, and Marc Swerts. 2000. Predicting automatic speech recognition performance using prosodic cues. Proc. of NAACL, pages 218–225. Mari Ostendorf, Patti Price, and Stefanie ShattuckHufnagel. 1995. The Boston University radio news corpus. Linguistic Data Consortium. Mari Ostendorf, Izhak Shafran, and Rebecca Bates. 2003. Prosody models for conversational speech recognition. Proc. of the 2nd Plenary Meeting and Symposium on Prosody and Speech Processing, pages 147–154. Andrew Rosenberg and Julia Hirschberg. 2006. Story segmentation of broadcast news in English, Mandarin and Arabic. Proc. of HLT-NAACL, pages 125–128. Elizabeth Shriberg, Andreas Stolcke, Dilek Hakkani-T u¨r, and G ¨okhan T u¨r. 2000. Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32(1-2): 127–154. Elizabeth Shriberg, Luciana Ferrer, Sachin S. Kajarekar, Anand Venkataraman, and Andreas Stolcke. 2005. Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46(3-4):455– 472. Vivek Kumar Rangarajan Sridhar, Srinivas Bangalore, and Shrikanth S. Narayanan. 2008. Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework. IEEE Transactions on Audio, Speech, and Language Processing, 16(4):797–81 1. Vivek Kumar Rangarajan Sridhar, Srinivas Bangalore, and Shrikanth Narayanan. 2009. Combining lexical, syntactic and prosodic cues for improved online dialog act tagging. Computer Speech and Language, 23(4):407–422. Andreas Stolcke, Barry Chen, Horacio Franco, Venkata Ramana Rao Gadde, Martin Graciarena, Mei-Yuh Hwang, Katrin Kirchhoff, Arindam Mandal, Nelson Morgan, Xin Lin, Tim Ng, Mari Ostendorf, Kemal S ¨onmez, Anand Venkataraman, Dimitra Vergyri, Wen Wang, Jing Zheng, and Qifeng Zhu. 2006. Recent innovations in speech-to-text transcription at SRI-ICSIUW. IEEE Transactions on Audio, Speech and Language Processing, 14(5): 1729–1744. Special Issue on Progress in Rich Transcription. Gyorgy Szaszak and Klara Vicsi. 2007. Speech recognition supported by prosodic information for fixed stress languages. Proc. of TSD Conference, pages 262–269. Dimitra Vergyri, Andreas Stolcke, Venkata R. R. Gadde, Luciana Ferrer, and Elizabeth Shriberg. 2003. Prosodic knowledge sources for automatic speech recognition. Proc. of ICASSP, pages 208–21 1. Colin W. Wightman and Mari Ostendorf. 1994. Automatic labeling of prosodic patterns. IEEE Transaction on Speech and Auido Processing, 2(4):469–481 . Jing Zheng, Ozgur Cetin, Mei-Yuh Hwang, Xin Lei, Andreas Stolcke, and Nelson Morgan. 2007. Combining discriminative feature, transform, and model training for large vocabulary speech recognition. Proc. of ICASSP, pages 633–636. 741