acl acl2012 acl2012-16 acl2012-16-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chia-ying Lee ; James Glass
Abstract: We investigate the problem of acoustic modeling in which prior language-specific knowledge and transcribed data are unavailable. We present an unsupervised model that simultaneously segments the speech, discovers a proper set of sub-word units (e.g., phones) and learns a Hidden Markov Model (HMM) for each induced acoustic unit. Our approach is formulated as a Dirichlet process mixture model in which each mixture is an HMM that represents a sub-word unit. We apply our model to the TIMIT corpus, and the results demonstrate that our model discovers sub-word units that are highly correlated with English phones and also produces better segmentation than the state-of-the-art unsupervised baseline. We test the quality of the learned acoustic models on a spoken term detection task. Compared to the baselines, our model improves the relative precision of top hits by at least 22.1% and outper- forms a language-mismatched acoustic model.
Chun-An Chan and Lin-Shan Lee. 2011. Unsupervised hidden Markov modeling of spoken queries for spoken term detection without speech recognition. In Pro- ceedings of INTERSPEECH, pages 2141 2144. Steven B. Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. on Acoustics, Speech, and Signal Processing, 28(4):357–366. Sorin Dusan and Lawrence Rabiner. 2006. On the relation between maximum spectral transition positions and phone boundaries. In Proceedings of INTERSPEECH, pages 1317 1320. Yago Pereiro Estevan, Vincent Wan, and Odette Scharenborg. 2007. Finding maximum margin segments in speech. In Proceedings of ICASSP, pages 937 940. Emily Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky. 2011. A sticky HDP-HMM with application to speaker diarization. Annals of Applied Statistics. Alvin Garcia and Herbert Gish. 2006. Keyword spotting of arbitrary words using minimal speech resources. In Proceedings of ICASSP, pages 949–952. John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallet, Nancy L. Dahlgren, and Victor Zue. 1993. Timit acousticphonetic continuous speech corpus. Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. Bayesian Data Analysis. Texts in Statistical Science. Chapman & Hall/CRC, second edition. – – – James Glass. 2003. A probabilistic framework for segment-based speech recognition. Computer Speech and Language, 17: 137 – 152. 49 Sharon Goldwater. 2009. A Bayesian framework for word segmentation: exploring the effects of context. Cognition, 112:21–54. Aren Jansen and Kenneth Church. 2011. Towards unsupervised training of speaker independent acoustic models. In Proceedings of INTERSPEECH, pages 1693 – 1696. Frederick Jelinek. 1976. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64:532 – 556. Sawit Kasuriya, Virach Sornlertlamvanich, Patcharika Cotsomrong, Supphanat Kanokphara, and Nattanun Thatphithakkul. 2003. Thai speech corpus for Thai speech recognition. In Proceedings of Oriental COCOSDA, pages 54–61 . Kai-Fu Lee and Hsiao-Wuen Hon. 1989. Speakerindependent phone recognition using hidden Markov models. IEEE Trans. on Acoustics, Speech, and Signal Processing, 37: 1641 – 1648. Chin-Hui Lee, Frank Soong, and Biing-Hwang Juang. 1988. A segment model based approach to speech recognition. In Proceedings of ICASSP, pages 501– 504. Kevin P. Murphy. 2007. Conjugate Bayesian analysis of the Gaussian distribution. Technical report, University of British Columbia. Radford M. Neal. 2000. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249– 265. Yu Qiao, Naoya Shimomura, and Nobuaki Minematsu. 2008. Unsupervised optimal phoeme segmentation: Objectives, algorithms and comparisons. In Proceedings of ICASSP, pages 3989 3992. Carl Edward Rasmussen. 2000. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems, 12:554–560. Odette Scharenborg, Vincent Wan, and Mirjam Ernestus. 2010. Unsupervised speech segmentation: An analysis of the hypothesized phone boundaries. Journal of the Acoustical Society of America, 127: 1084–1095. Balakrishnan Varadarajan, Sanjeev Khudanpur, and Emmanuel Dupoux. 2008. Unsupervised learning of acoustic sub-word units. In Proceedings of ACL-08: HLT, Short Papers, pages 165–168. Yaodong Zhang and James Glass. 2009. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In Proceedings of ASRU, pages 398 403. Yaodong Zhang, Ruslan Salakhutdinov, Hung-An Chang, – – and James Glass. 2012. Resource configurable spoken query detection using deep Boltzmann machines. In Proceedings of ICASSP, pages 5 161–5 164.